serialization
Rankings
| # | Model | Solve AUC@2 Attempt-adjusted solve rate: first-try solve = 1.0, second-attempt-only = 0.5. Formula: Primary ranking metric. Rewards first-try correctness over fail-then-repair without ignoring the two-attempt protocol. De-saturates the headline that pass_at_n compresses. Significance via paired bootstrap (tier bands), not Wilson. | Avg attempt score Mean per-attempt score on a 0–100 point scale (partial credit). Drill-down only. Formula: Drill-down companion to pass_at_n. Rewards partial credit but not directly comparable to pass rate; use for within-model analysis. | Pass rate Tasks solved / tasks in scope, up to 2 attempts (strict per-set denominator). Formula: Includes unattempted tasks as failures. Scope-aware; reflects active filters (set, category, difficulty). Final assisted solve rate with up to 2 attempts; drill-down companion to Solve AUC@2. | First-try pass rate Tasks solved on the first attempt / tasks in scope (strict). Formula: Measures single-shot accuracy without retry credit. Useful when comparing models where the second attempt is not available. | Repair Repair rate Share of first-attempt failures the model fixed on attempt 2. Formula: Conditional recovery skill: high = good at reading compiler/test errors and patching. Profile column, not a ranking metric. | Confidence ± Pass Rate 95% CI 95% Wilson confidence interval on the pass rate. Formula: Use to judge whether a lead over another model is statistically meaningful. Wide CIs indicate too few tasks to draw firm conclusions. | Avg cost / task Average LLM cost per distinct benchmark task in USD. Formula: Use to compare operating cost across models with similar pass rates. Does not account for quality. Combine with $/Pass for a cost-efficiency view. | $/Pass Total cost divided by number of distinct tasks passed. Lower is better. Formula: Best single cost-efficiency metric. Penalises expensive models that pass few tasks and rewards cheap models with high pass rates. | Latency p95 95th-percentile per-task wall time. Captures tail latency. Formula: Use p95 to understand worst-case latency. A low p95 means the model rarely stalls, relevant for automated pipelines with timeouts. | Last seen | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | 100.0 | 100.0 / 100 | 100.0 | 1/1 | 0.0% | ±39.7% | $0.10 | $0.0962 | 18.1s | 17h ago | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Tier 2 | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| 2 | 100.0 | 85.7 / 100 | 100.0 | 1/1 | 0.0% | ±39.7% | $0.03 | $0.0293 | 81.4s | 17h ago | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| 3 | (66K) | 100.0 | 100.0 / 100 | 100.0 | 1/1 | 0.0% | ±39.7% | $0.01 | $0.0145 | 76.0s | 17h ago | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Tier 3 | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| 4 | 100.0 | 100.0 / 100 | 100.0 | 1/1 | 0.0% | ±39.7% | $0.007 | $0.0068 | 14.9s | 1d ago | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| 5 | 100.0 | 100.0 / 100 | 100.0 | 1/1 | 0.0% | ±39.7% | $0.04 | $0.0408 | 19.8s | 1d ago | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| 6 | 100.0 | 100.0 / 100 | 100.0 | 1/1 | 0.0% | ±39.7% | $0.11 | $0.1096 | 33.7s | 14d ago | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| 7 | 100.0 | 100.0 / 100 | 100.0 | 1/1 | 0.0% | ±39.7% | $0.03 | $0.0341 | 21.8s | 14d ago | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| 8 | 100.0 | 100.0 / 100 | 100.0 | 1/1 | 0.0% | ±39.7% | $0.10 | $0.0968 | 20.4s | 17h ago | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||