About CentralGauge

CentralGauge is an open-source benchmark for evaluating large language models on AL code generation, debugging, and refactoring for Microsoft Dynamics 365 Business Central.

Status

The site is in beta (P5). Detailed methodology, scoring formulas, tier definitions, and transparency documentation will land before public launch.

Scoring metrics

CentralGauge surfaces two distinct metrics. They measure different things and may diverge for the same model.

avg_score (per-attempt)

The leaderboard's Score column averages over every attempt row in results (each task contributes 2 rows: attempt 1 and attempt 2). This captures partial credit: a task scoring 0.5 on attempt 1 and 1.0 on attempt 2 contributes 0.75 to avg_score.

pass_at_n (per-task, "best across runs")

The Pass@N metric is the fraction of distinct tasks the model eventually solved (in any attempt, in any run). With multi-run data, the rule is "best across runs per task":

Pass@1: distinct tasks where SOME run had attempt-1 succeed.
Pass@2-only: distinct tasks where SOME run had attempt-2 succeed AND no run had attempt-1 succeed.
Pass@N = (Pass@1 + Pass@2-only) / tasks_attempted_distinct.

Concrete example: a model runs T1 twice. Run 1 succeeds on attempt 1; Run 2 succeeds only on attempt 2. T1 counts toward Pass@1 (the model demonstrated first-try capability somewhere), NOT Pass@2-only. The invariant Pass@1 + Pass@2-only ≤ tasks_attempted_distinct always holds, with no double-counting across runs.

tasks_attempted vs tasks_attempted_distinct

The API exposes both: tasks_attempted (per-attempt; COUNT(*) over rows in results) and tasks_attempted_distinct (per-task; COUNT(DISTINCT task_id)). Pass@N's denominator is tasks_attempted_distinct; tasks_attempted is preserved for back-compatibility with consumers built before the per-task split. The numbers differ: for a model with 4 tasks attempted twice each, tasks_attempted is 8, tasks_attempted_distinct is 4.

Why both?

avg_score rewards models that get close on tricky tasks. pass_at_n rewards models that just finish. The leaderboard sort toggle (?sort=avg_score, ?sort=pass_at_n, ?sort=pass_at_1) lets you rank by whichever matters for your use case.

The Pass@1 / Pass@2 stacked bar on each leaderboard row visualizes the per-task breakdown: green for first-try success, amber for retry-recovery, red for unsolved.

Metrics glossary

All metrics shown on the leaderboard and model detail pages are defined here. Each definition includes the formula used to compute it and guidance on when each metric is most useful.

Pass Rate

Fraction of distinct tasks solved in any attempt across all runs.

Formula: (tasks_passed_attempt_1 + tasks_passed_attempt_2_only) / tasks_attempted_distinct

Primary ranking metric. Compare models here first: it directly measures how often the model delivers working code.

HumanEval paper (Chen et al., 2021) ↗

Pass Rate 95% CI

95% Wilson confidence interval on the pass rate.

Formula: Wilson score interval: center ± half-width, where n = tasks_attempted_distinct.

Use to judge whether a lead over another model is statistically meaningful. Wide CIs indicate too few tasks to draw firm conclusions.

Wilson score interval (Wikipedia) ↗

pass^n (strict)

Fraction of tasks the model solved in every single run (strict consistency).

Formula: tasks where ALL runs produced a passing result / tasks_attempted_distinct

Measures reliability under repetition. High pass^n means the model is unlikely to regress on a re-run, important for CI and production use.

Score

Average score per attempt row (0–1). Rewards partial credit.

Formula: Mean of all attempt scores across all results rows: SUM(score) / COUNT(*) over the results table.

Ranks models that make consistent partial progress on hard tasks. A model that scores 0.5 on every task beats one that passes half and fails the rest on this metric.

Cost / run

Average total LLM cost per benchmark run in USD.

Formula: SUM(cost_usd) / run_count across all runs for this model.

Use to compare operating cost across models with similar pass rates. Does not account for quality. Combine with $/Pass for a cost-efficiency view.

$/Pass

Total cost divided by number of distinct tasks passed. Lower is better.

Formula: SUM(cost_usd) / tasks_passed_distinct across all runs.

Best single cost-efficiency metric. Penalises expensive models that pass few tasks and rewards cheap models with high pass rates.

Latency p50

Median per-task wall time (LLM call + compile + test), in milliseconds.

Formula: 50th percentile of per-task duration_ms: LLM latency + compile time + test time.

Use p50 for a typical-case latency expectation. Unaffected by outlier slow tasks.

Latency p95

95th-percentile per-task wall time. Captures tail latency.

Formula: 95th percentile of per-task duration_ms across all tasks in all runs.

Use p95 to understand worst-case latency. A low p95 means the model rarely stalls, relevant for automated pipelines with timeouts.

Consistency

Percentage of tasks with the same outcome (all pass or all fail) across every run.

Formula: tasks where all runs agree (all pass OR all fail) / tasks_attempted_distinct × 100.

High consistency means the model behaves predictably. Low consistency flags flaky tasks or a model sensitive to stochastic generation noise.

Tasks attempted

Count of distinct tasks the model has attempted at least once.

Formula: COUNT(DISTINCT task_id) over the results table for this model.

Denominator for pass_at_n and consistency. A model with fewer tasks attempted has a narrower coverage sample.

Tasks passed

Distinct tasks solved in any attempt across all runs.

Formula: COUNT(DISTINCT task_id) where best outcome = pass.

Absolute count version of pass_at_n. Useful when comparing models that have attempted different task counts.

Runs

Total number of benchmark runs recorded for this model.

Formula: COUNT(DISTINCT run_id) for this model.

More runs = more data, tighter confidence intervals, and more reliable pass^n / consistency metrics.

Verified runs

Runs signed and verified by an independent verifier machine.

Formula: COUNT of runs where the Ed25519 signature was verified by the worker at ingest.

Verified runs have a stronger integrity guarantee (Ed25519 signature verified at ingest).

Transparency

Every benchmark run is signed with an Ed25519 keypair held by the operator's machine and verified by the worker before ingest. The signed payload, public key, and verification record are linked from each run's detail page.

Source code is available on GitHub.