serialization · Categories

Rankings

Leaderboard
#	Model	Solve AUC@2 Attempt-adjusted solve rate: first-try solve = 1.0, second-attempt-only = 0.5. Formula: `(pass_at_1 + pass_at_n) / 2 = (2·tasks_passed_attempt_1 + tasks_passed_attempt_2_only) / (2·task_set_size)` Primary ranking metric. Rewards first-try correctness over fail-then-repair without ignoring the two-attempt protocol. De-saturates the headline that pass_at_n compresses. Significance via paired bootstrap (tier bands), not Wilson.	Avg attempt score Mean per-attempt score on a 0–100 point scale (partial credit). Drill-down only. Formula: `Mean of attempt scores across all results rows: SUM(score) / COUNT() over the results table. Each attempt earns 0–100 points based on compile + test outcomes.` Drill-down companion to pass_at_n. Rewards partial credit but not directly comparable to pass rate; use for within-model analysis.*	Pass rate Tasks solved / tasks in scope, up to 2 attempts (strict per-set denominator). Formula: `(tasks_passed_attempt_1 + tasks_passed_attempt_2_only) / task_set_size` Includes unattempted tasks as failures. Scope-aware; reflects active filters (set, category, difficulty). Final assisted solve rate with up to 2 attempts; drill-down companion to Solve AUC@2. HumanEval paper (Chen et al., 2021) ↗	First-try pass rate Tasks solved on the first attempt / tasks in scope (strict). Formula: `tasks_passed_attempt_1 / task_set_size` Measures single-shot accuracy without retry credit. Useful when comparing models where the second attempt is not available.	Repair Repair rate Share of first-attempt failures the model fixed on attempt 2. Formula: `(pass_at_n − pass_at_1) / (1 − pass_at_1); defined as 0 when pass_at_1 = 1.` Conditional recovery skill: high = good at reading compiler/test errors and patching. Profile column, not a ranking metric.	Confidence ± Pass Rate 95% CI 95% Wilson confidence interval on the pass rate. Formula: `Wilson score interval: center ± half-width, where n = strict denominator (task_set_size or category/difficulty-scoped count when taskSetHash is provided; falls back to tasks_attempted_distinct for legacy callers).` Use to judge whether a lead over another model is statistically meaningful. Wide CIs indicate too few tasks to draw firm conclusions. Wilson score interval (Wikipedia) ↗	Avg cost / task Average LLM cost per distinct benchmark task in USD. Formula: `SUM(cost_usd) / COUNT(DISTINCT task_id) across all the model's results in scope.` Use to compare operating cost across models with similar pass rates. Does not account for quality. Combine with $/Pass for a cost-efficiency view.	$/Pass Total cost divided by number of distinct tasks passed. Lower is better. Formula: `SUM(cost_usd) / tasks_passed_distinct across all runs.` Best single cost-efficiency metric. Penalises expensive models that pass few tasks and rewards cheap models with high pass rates.	Latency p95 95th-percentile per-task wall time. Captures tail latency. Formula: `95th percentile of per-task duration_ms across all tasks in all runs.` Use p95 to understand worst-case latency. A low p95 means the model rarely stalls, relevant for automated pipelines with timeouts.	Last seen
1	Claude Opus 4.8 claude claude-opus-4-8	100.0	100.0 / 100	100.0	1/1	0.0%	±39.7%	$0.10	$0.0962	18.1s	17h ago
Tier 2
2	Gemini 3.5 Flash gemini gemini-3.5-flash	100.0	85.7 / 100	100.0	1/1	0.0%	±39.7%	$0.03	$0.0293	81.4s	17h ago
3	Gemini 3.1 Pro Preview gemini gemini-3.1-pro-preview (66K)	100.0	100.0 / 100	100.0	1/1	0.0%	±39.7%	$0.01	$0.0145	76.0s	17h ago
Tier 3
4	Claude Haiku 4 5 20251001 claude claude-haiku-4-5-20251001	100.0	100.0 / 100	100.0	1/1	0.0%	±39.7%	$0.007	$0.0068	14.9s	1d ago
5	Claude Sonnet 4 6 claude claude-sonnet-4-6	100.0	100.0 / 100	100.0	1/1	0.0%	±39.7%	$0.04	$0.0408	19.8s	1d ago
6	GPT-5.5 gpt gpt-5.5	100.0	100.0 / 100	100.0	1/1	0.0%	±39.7%	$0.11	$0.1096	33.7s	14d ago
7	Claude Opus 4.6 claude claude-opus-4-6	100.0	100.0 / 100	100.0	1/1	0.0%	±39.7%	$0.03	$0.0341	21.8s	14d ago
8	Claude Opus 4.7 claude claude-opus-4-7	100.0	100.0 / 100	100.0	1/1	0.0%	±39.7%	$0.10	$0.0968	20.4s	17h ago