About CentralGauge

CentralGauge is an open-source benchmark for evaluating large language models on AL code generation, debugging, and refactoring for Microsoft Dynamics 365 Business Central.

Status

The site is in beta (P5). Detailed methodology, scoring formulas, tier definitions, and transparency documentation will land before public launch.

Units & conventions

Every metric on this site falls into one of six unit categories. The glossary entry for each metric tells you which category it belongs to; the table below tells you how to read the displayed value.

UnitStored asShown asExamples
rateFraction in [0, 1]78.1% (×100 + percent)Pass rate, First-try pass rate, All-runs pass rate
pctAlready on a 0–100 scale73.4% (no scaling)Consistency
scorePoints in [0, 100] (partial credit allowed)71.0 / 100Avg attempt score
usdUS dollars$0.12Avg cost / task, Cost / pass
countInteger12,345 (locale grouping)Tasks attempted, Runs
duration_msMilliseconds1.5s / 2m 48sLatency p50, Latency p95

Rates vs. scores. A rate is a fraction (0–1) of tasks the model solved; we show it as a percent. A score is a per-attempt grade out of 100 that can include partial credit (for example, an attempt that compiled but failed half the tests). We render scores as X.X / 100 so they cannot be confused with a percent.

Scoring metrics

CentralGauge surfaces three metrics. They measure different things and may diverge for the same model.

Pass rate (pass_at_n)

The fraction of tasks in scope the model eventually solves, with up to 2 attempts. Strict denominator: tasks the model did not attempt count as failures.

Worked example. Task set has 50 tasks. Model attempted 4 tasks and passed all 4 — pass rate = 4/50 = 8%. Not 4/4 = 100%. The strict per-set denominator punishes incomplete coverage so a partial-bench run cannot look better than a thorough one.

The Pass@1 / Pass@2 stacked bar on each leaderboard row visualizes the per-task breakdown: green for first-try success, amber for retry-recovery, red for unsolved.

First-try pass rate (pass_at_1)

Same strict denominator; numerator counts only attempt-1 successes. Used as the leaderboard's tiebreaker when two models share the same pass_at_n.

Avg attempt score (avg_score)

Mean of results.score across every attempt (each task contributes up to 2 rows). Lower than pass_at_n because failed attempts pull the mean down. Useful as a drill-down to see partial credit, but never the default rank.

Set selection

Filters set=current and a specific 64-char hash are honored. set=all is rejected (HTTP 400) — strict pass rate is undefined across multiple sets with different denominators.

Metrics glossary

All metrics shown on the leaderboard and model detail pages are defined here. Each definition includes the formula used to compute it and guidance on when each metric is most useful.

Pass rate rate

Tasks solved / tasks in scope, up to 2 attempts (strict per-set denominator).

Formula: (tasks_passed_attempt_1 + tasks_passed_attempt_2_only) / task_set_size

Includes unattempted tasks as failures. Scope-aware; reflects active filters (set, category, difficulty). Primary ranking metric.

First-try pass rate rate

Tasks solved on the first attempt / tasks in scope (strict).

Formula: tasks_passed_attempt_1 / task_set_size

Measures single-shot accuracy without retry credit. Useful when comparing models where the second attempt is not available.

Pass Rate 95% CI rate

95% Wilson confidence interval on the pass rate.

Formula: Wilson score interval: center ± half-width, where n = strict denominator (task_set_size or category/difficulty-scoped count when taskSetHash is provided; falls back to tasks_attempted_distinct for legacy callers).

Use to judge whether a lead over another model is statistically meaningful. Wide CIs indicate too few tasks to draw firm conclusions.

Run success rate rate

Tasks the run solved on its last attempt / tasks attempted in this run.

Formula: COUNT(distinct tasks where last attempt passed) / COUNT(distinct tasks attempted in this run)

Per-run metric for the model's "final answer" on each task. Differs from leaderboard pass_at_n: this denominator is the run's own attempted-task count, not the task set size, so partial runs are not penalised for unattempted tasks.

All-runs pass rate rate

Fraction of tasks the model solved in every single run (strict consistency, also written pass^n).

Formula: tasks where ALL runs produced a passing result / tasks_attempted_distinct

Measures reliability under repetition. High value means the model is unlikely to regress on a re-run, important for CI and production use. Formal name in the literature: pass^n.

Avg attempt score score

Mean per-attempt score on a 0–100 point scale (partial credit). Drill-down only.

Formula: Mean of attempt scores across all results rows: SUM(score) / COUNT(*) over the results table. Each attempt earns 0–100 points based on compile + test outcomes.

Drill-down companion to pass_at_n. Rewards partial credit but not directly comparable to pass rate; use for within-model analysis.

Avg cost / task usd

Average LLM cost per distinct benchmark task in USD.

Formula: SUM(cost_usd) / COUNT(DISTINCT task_id) across all the model's results in scope.

Use to compare operating cost across models with similar pass rates. Does not account for quality. Combine with $/Pass for a cost-efficiency view.

$/Pass usd

Total cost divided by number of distinct tasks passed. Lower is better.

Formula: SUM(cost_usd) / tasks_passed_distinct across all runs.

Best single cost-efficiency metric. Penalises expensive models that pass few tasks and rewards cheap models with high pass rates.

Latency p50 duration_ms

Median per-task wall time (LLM call + compile + test), in milliseconds.

Formula: 50th percentile of per-task duration_ms: LLM latency + compile time + test time.

Use p50 for a typical-case latency expectation. Unaffected by outlier slow tasks.

Latency p95 duration_ms

95th-percentile per-task wall time. Captures tail latency.

Formula: 95th percentile of per-task duration_ms across all tasks in all runs.

Use p95 to understand worst-case latency. A low p95 means the model rarely stalls, relevant for automated pipelines with timeouts.

Consistency pct

Percentage (0–100) of tasks with the same outcome across every run.

Formula: 100 × tasks where all runs agree (all pass OR all fail) / tasks_attempted_distinct.

High consistency means the model behaves predictably. Low consistency flags flaky tasks or a model sensitive to stochastic generation noise.

Tasks attempted count

Count of distinct tasks the model has attempted at least once.

Formula: COUNT(DISTINCT task_id) over the results table for this model.

Coverage indicator. Strict pass_at_n still counts unattempted tasks as failures; use this to see how much of the active task set the model covered.

Tasks passed count

Distinct tasks solved in any attempt across all runs.

Formula: COUNT(DISTINCT task_id) where best outcome = pass.

Absolute count version of pass_at_n. Useful when comparing models that have attempted different task counts.

Runs count

Total number of benchmark runs recorded for this model.

Formula: COUNT(DISTINCT run_id) for this model.

More runs = more data, tighter confidence intervals, and more reliable pass^n / consistency metrics.

Verified runs count

Runs signed and verified by an independent verifier machine.

Formula: COUNT of runs where the Ed25519 signature was verified by the worker at ingest.

Verified runs have a stronger integrity guarantee (Ed25519 signature verified at ingest).

Transparency

Every benchmark run is signed with an Ed25519 keypair held by the operator's machine and verified by the worker before ingest. The signed payload, public key, and verification record are linked from each run's detail page.

Source code is available on GitHub.