Rankings

Leaderboard
#Model

Solve AUC@2

Skill score: full credit for a first-try solve, half for a retry solve. Not the solve rate.

Formula: AUC@2 = (pass@1 + solve@2) / 2

Use as the headline ranking metric. Rewards first-try correctness over fail-then-repair without ignoring the two-attempt protocol. De-saturates the headline that pass_at_n compresses. Significance via paired bootstrap (tier bands), not Wilson.

CI

Pass Rate 95% CI

95% Wilson confidence interval on the pass rate.

Formula: Wilson score interval: center ± half-width, where n = strict denominator (task_set_size or category/difficulty-scoped count when taskSetHash is provided; falls back to tasks_attempted_distinct for legacy callers).

Use to judge whether a lead over another model is statistically meaningful. Wide CIs indicate too few tasks to draw firm conclusions.

Avg cost / task

Average LLM cost per distinct benchmark task in USD.

Formula: SUM(cost_usd) / COUNT(DISTINCT task_id) across all the model's results in scope.

Use to compare operating cost across models with similar pass rates. Does not account for quality. Combine with $/Pass for a cost-efficiency view.

Latency p95

95th-percentile per-task wall time. Captures tail latency.

Formula: 95th percentile of per-task duration_ms across all tasks in all runs.

Use p95 to understand worst-case latency. A low p95 means the model rarely stalls, relevant for automated pipelines with timeouts.

Details
1
Claude Fable 5 claude claude-fable-5
100.0 ±16.2$0.2041.6s
2
GPT-5.5 gpt gpt-5.5
100.0 ±16.2$0.25139.6s
3
Gemini 3.1 Pro Preview gemini gemini-3.1-pro-preview
(66K)
93.8 ±16.2$0.01*526.5s
4
Claude Opus 4.7 claude claude-opus-4-7
87.5 ±16.2$0.1124.9s
5
Claude Opus 4.8 claude claude-opus-4-8
81.3 ±22.4$0.1222.4s
6
Claude Opus 4.6 claude claude-opus-4-6
81.3 ±22.4$0.0527.4s
Tier 2
7
Gemini 3.5 Flash gemini gemini-3.5-flash
81.3 ±16.2$0.03*436.8s
8
Claude Sonnet 4 6 claude claude-sonnet-4-6
81.3 ±16.2$0.0827.4s
Tier 3
9
Claude Haiku 4 5 20251001 claude claude-haiku-4-5-20251001
18.8 ±27.9$0.0216.0s