Rankings

Leaderboard
#Model

Solve AUC@2

Attempt-adjusted solve rate: first-try solve = 1.0, second-attempt-only = 0.5.

Formula: (pass_at_1 + pass_at_n) / 2 = (2·tasks_passed_attempt_1 + tasks_passed_attempt_2_only) / (2·task_set_size)

Primary ranking metric. Rewards first-try correctness over fail-then-repair without ignoring the two-attempt protocol. De-saturates the headline that pass_at_n compresses. Significance via paired bootstrap (tier bands), not Wilson.

Avg attempt score

Mean per-attempt score on a 0–100 point scale (partial credit). Drill-down only.

Formula: Mean of attempt scores across all results rows: SUM(score) / COUNT(*) over the results table. Each attempt earns 0–100 points based on compile + test outcomes.

Drill-down companion to pass_at_n. Rewards partial credit but not directly comparable to pass rate; use for within-model analysis.

Pass rate

Tasks solved / tasks in scope, up to 2 attempts (strict per-set denominator).

Formula: (tasks_passed_attempt_1 + tasks_passed_attempt_2_only) / task_set_size

Includes unattempted tasks as failures. Scope-aware; reflects active filters (set, category, difficulty). Final assisted solve rate with up to 2 attempts; drill-down companion to Solve AUC@2.

First-try pass rate

Tasks solved on the first attempt / tasks in scope (strict).

Formula: tasks_passed_attempt_1 / task_set_size

Measures single-shot accuracy without retry credit. Useful when comparing models where the second attempt is not available.

Repair

Repair rate

Share of first-attempt failures the model fixed on attempt 2.

Formula: (pass_at_n − pass_at_1) / (1 − pass_at_1); defined as 0 when pass_at_1 = 1.

Conditional recovery skill: high = good at reading compiler/test errors and patching. Profile column, not a ranking metric.

Confidence ±

Pass Rate 95% CI

95% Wilson confidence interval on the pass rate.

Formula: Wilson score interval: center ± half-width, where n = strict denominator (task_set_size or category/difficulty-scoped count when taskSetHash is provided; falls back to tasks_attempted_distinct for legacy callers).

Use to judge whether a lead over another model is statistically meaningful. Wide CIs indicate too few tasks to draw firm conclusions.

Avg cost / task

Average LLM cost per distinct benchmark task in USD.

Formula: SUM(cost_usd) / COUNT(DISTINCT task_id) across all the model's results in scope.

Use to compare operating cost across models with similar pass rates. Does not account for quality. Combine with $/Pass for a cost-efficiency view.

$/Pass

Total cost divided by number of distinct tasks passed. Lower is better.

Formula: SUM(cost_usd) / tasks_passed_distinct across all runs.

Best single cost-efficiency metric. Penalises expensive models that pass few tasks and rewards cheap models with high pass rates.

Latency p95

95th-percentile per-task wall time. Captures tail latency.

Formula: 95th percentile of per-task duration_ms across all tasks in all runs.

Use p95 to understand worst-case latency. A low p95 means the model rarely stalls, relevant for automated pipelines with timeouts.

Last seen
1
Claude Opus 4.8 claude claude-opus-4-8
100.0100.0 / 100100.0 1/10.0%±39.7%$0.08$0.075918.3s17h ago
Tier 2
2
Gemini 3.5 Flash gemini gemini-3.5-flash
100.0100.0 / 100100.0 1/10.0%±39.7%$0.02$0.018133.1s17h ago
3
Gemini 3.1 Pro Preview gemini gemini-3.1-pro-preview
(66K)
100.0100.0 / 100100.0 1/10.0%±39.7%$0.01$0.012022.3s17h ago
4
Claude Sonnet 4 6 claude claude-sonnet-4-6
100.0100.0 / 100100.0 1/10.0%±39.7%$0.03$0.034418.4s1d ago
5
GPT-5.5 gpt gpt-5.5
100.0100.0 / 100100.0 1/10.0%±39.7%$0.07$0.071224.9s14d ago
6
Claude Opus 4.6 claude claude-opus-4-6
100.0100.0 / 100100.0 1/10.0%±39.7%$0.03$0.028218.8s14d ago
7
Claude Opus 4.7 claude claude-opus-4-7
100.0100.0 / 100100.0 1/10.0%±39.7%$0.08$0.076518.9s17h ago
Tier 3
8
Claude Haiku 4 5 20251001 claude claude-haiku-4-5-20251001
100.050.0 / 100100.0 1/1100.0%±39.7%$0.01$0.012616.7s1d ago