CentralGauge

Benchmark for LLMs on Microsoft Dynamics 365 Business Central AL code.

Updated 0s ago 110 tasks 2 attempts/model 95% paired-bootstrap CI Solve AUC@2 = (pass@1 + solve@2) / 2

Best overall

Claude Fable 5 claude claude-fable-5

· 87.3

Tier 1 · tied with Gemini 3.1 Pro Preview

Best value · AUC ≥ 75

Claude Opus 4.6 claude claude-opus-4-6

79.1 AUC · $0.13/solved

Fastest ≥ 75 AUC

Claude Opus 4.8 claude claude-opus-4-8

p95 150.1s · 81.4 AUC

Best open-weight

—

Leaderboard
#	Model	Solve AUC@2 Skill score: full credit for a first-try solve, half for a retry solve. Not the solve rate. Formula: `AUC@2 = (pass@1 + solve@2) / 2` Use as the headline ranking metric. Rewards first-try correctness over fail-then-repair without ignoring the two-attempt protocol. De-saturates the headline that pass_at_n compresses. Significance via paired bootstrap (tier bands), not Wilson.	CI Pass Rate 95% CI 95% Wilson confidence interval on the pass rate. Formula: `Wilson score interval: center ± half-width, where n = strict denominator (task_set_size or category/difficulty-scoped count when taskSetHash is provided; falls back to tasks_attempted_distinct for legacy callers).` Use to judge whether a lead over another model is statistically meaningful. Wide CIs indicate too few tasks to draw firm conclusions. Wilson score interval (Wikipedia) ↗	Avg cost / task Average LLM cost per distinct benchmark task in USD. Formula: `SUM(cost_usd) / COUNT(DISTINCT task_id) across all the model's results in scope.` Use to compare operating cost across models with similar pass rates. Does not account for quality. Combine with $/Pass for a cost-efficiency view.	Latency p95 95th-percentile per-task wall time. Captures tail latency. Formula: `95th percentile of per-task duration_ms across all tasks in all runs.` Use p95 to understand worst-case latency. A low p95 means the model rarely stalls, relevant for automated pipelines with timeouts.
1	Claude Fable 5 claude claude-fable-5	87.3	±5.0	$0.34	175.9s
2	Gemini 3.1 Pro Preview gemini gemini-3.1-pro-preview (66K)	83.6	±5.2	$0.03*	570.9s
Tier 2
3	Claude Opus 4.8 claude claude-opus-4-8	81.4	±6.1	$0.23	150.1s
4	Claude Opus 4.7 claude claude-opus-4-7	80.5	±6.1	$0.22	157.5s
5	Claude Opus 4.6 claude claude-opus-4-6	79.1	±6.3	$0.11	166.6s
6	Gemini 3.5 Flash gemini gemini-3.5-flash	78.6	±6.1	$0.06*	363.0s
7	GPT-5.5 gpt gpt-5.5	76.4	±6.7	$0.38	221.4s
Tier 3
8	Claude Sonnet 4 6 claude claude-sonnet-4-6	75.0	±6.3	$0.17	165.5s
Tier 4
9	Claude Haiku 4 5 20251001 claude claude-haiku-4-5-20251001	53.6	±8.8	$0.02	140.2s

Showing 9 of 9