Pass@N

Pass rate

Tasks solved / tasks in scope, up to 2 attempts (strict per-set denominator).

Formula: (tasks_passed_attempt_1 + tasks_passed_attempt_2_only) / task_set_size

Includes unattempted tasks as failures. Scope-aware; reflects active filters (set, category, difficulty). Primary ranking metric.

0.0%
Tasks pass 0/0
1st: 0 2nd: 0 Failed: 0
Avg cost / task

Avg cost / task

Average LLM cost per distinct benchmark task in USD.

Formula: SUM(cost_usd) / COUNT(DISTINCT task_id) across all the model's results in scope.

Use to compare operating cost across models with similar pass rates. Does not account for quality. Combine with $/Pass for a cost-efficiency view.

$0.00
Latency p50

Latency p50

Median per-task wall time (LLM call + compile + test), in milliseconds.

Formula: 50th percentile of per-task duration_ms: LLM latency + compile time + test time.

Use p50 for a typical-case latency expectation. Unaffected by outlier slow tasks.

0ms
Avg score

Avg attempt score

Mean per-attempt score on a 0–100 point scale (partial credit). Drill-down only.

Formula: Mean of attempt scores across all results rows: SUM(score) / COUNT(*) over the results table. Each attempt earns 0–100 points based on compile + test outcomes.

Drill-down companion to pass_at_n. Rewards partial credit but not directly comparable to pass rate; use for within-model analysis.

0.0 / 100
All-runs pass rate

All-runs pass rate

Fraction of tasks the model solved in every single run (strict consistency, also written pass^n).

Formula: tasks where ALL runs produced a passing result / tasks_attempted_distinct

Measures reliability under repetition. High value means the model is unlikely to regress on a re-run, important for CI and production use. Formal name in the literature: pass^n.

0.0%
$/Pass

$/Pass

Total cost divided by number of distinct tasks passed. Lower is better.

Formula: SUM(cost_usd) / tasks_passed_distinct across all runs.

Best single cost-efficiency metric. Penalises expensive models that pass few tasks and rewards cheap models with high pass rates.

Latency p95

Latency p95

95th-percentile per-task wall time. Captures tail latency.

Formula: 95th percentile of per-task duration_ms across all tasks in all runs.

Use p95 to understand worst-case latency. A low p95 means the model rarely stalls, relevant for automated pipelines with timeouts.

0ms

Overview

Qwen: Qwen3.6 35B A3B has run on 0 occasions, attempting 0 tasks with an average score of 0.0 / 100.

Settings

Generation parameters used across this model's runs. "varies" indicates the value differed between runs.

Temperature
varies
Thinking budget
varies
Avg tokens / run (input + output)
Consistency

History

No runs yet.

Cost

No cost data yet.

Shortcomings

AL concepts Qwen: Qwen3.6 35B A3B struggles with. Click a row for description, correct pattern, and observed error codes.

Shortcomings analysis queued

Queued for analysis. This section will populate once the run is processed.

Recent runs

Runs
StartedRunModelTasksScoreCostDurationStatus

See all 0 runs →

Methodology

Pass rate counts unattempted tasks as failures (strict denominator). Avg score is the mean of every attempt the model produced — failed first tries that triggered a retry contribute one observation each, pulling the mean down. See the about page for the full breakdown and unit conventions.