About CentralGauge

CentralGauge is an open-source benchmark for evaluating large language models on AL code generation, debugging, and refactoring for Microsoft Dynamics 365 Business Central.

Status

The site is in beta (P5). Detailed methodology, scoring formulas, tier definitions, and transparency documentation will land before public launch.

Units & conventions

Every metric on this site falls into one of six unit categories. The glossary entry for each metric tells you which category it belongs to; the table below tells you how to read the displayed value.

UnitStored asShown asExamples
rateFraction in [0, 1]78.1% (×100 + percent)Pass rate, First-try pass rate, All-runs pass rate
pctAlready on a 0–100 scale73.4% (no scaling)Consistency
scorePoints in [0, 100] (partial credit allowed)71.0 / 100Avg attempt score
usdUS dollars$0.12Avg cost / task, Cost / pass
countInteger12,345 (locale grouping)Tasks attempted, Runs
duration_msMilliseconds1.5s / 2m 48sLatency p50, Latency p95

Rates vs. scores. A rate is a fraction (0–1) of tasks the model solved; we show it as a percent. A score is a per-attempt grade out of 100 that can include partial credit (for example, an attempt that compiled but failed half the tests). We render scores as X.X / 100 so they cannot be confused with a percent.

Scoring metrics

CentralGauge surfaces four metrics. They measure different aspects of model performance and may diverge for the same model.

Solve AUC@2 (auc_2) — headline ranking metric

The primary ranking metric. Defined as (pass_at_1 + pass_at_n) / 2: a model that solves a task on the first try contributes 1.0, a model that needs the second attempt contributes 0.5, and an unsolved task contributes 0.

AUC@2 replaced pass_at_n as the headline because, at the current task-set size, top models were compressed into overlapping confidence intervals on pass_at_n — making rankings statistically indistinguishable. AUC@2 de-saturates the top of the leaderboard by rewarding first-try solves more than retry recoveries.

First-try pass rate (pass_at_1)

Fraction of tasks in scope the model solves on the first attempt. Strict denominator: tasks the model did not attempt count as failures.

Best-of-2 pass rate (pass_at_n)

Fraction of tasks in scope the model eventually solves with up to 2 attempts. Same strict denominator as pass_at_1. Retained as a profile column so you can see how much a model benefits from retries.

Worked example. Task set has 50 tasks. Model attempted 4 tasks and passed all 4 — pass rate = 4/50 = 8%. Not 4/4 = 100%. The strict per-set denominator punishes incomplete coverage so a partial-bench run cannot look better than a thorough one.

The Pass@1 / Pass@2 stacked bar on each leaderboard row visualizes the per-task breakdown: green for first-try success, amber for retry-recovery, red for unsolved.

Repair rate (repair_rate)

Conditional on failing the first attempt: the fraction of first-try failures the model recovers on the second attempt. Formula: (pass_at_n - pass_at_1) / (1 - pass_at_1). Zero when the model passes everything on the first try (nothing left to repair).

Avg attempt score (avg_score)

Mean of results.score across every attempt (each task contributes up to 2 rows). Lower than pass_at_n because failed attempts pull the mean down. Useful as a drill-down to see partial credit, but not the ranking metric.

Statistical tier bands

The leaderboard groups models into statistical tiers based on paired bootstrap resampling over the shared task set. Models in the same tier are not statistically distinguishable at the current sample size — the apparent ranking difference within a tier may be noise. Tier boundaries are shown as divider rows in the table.

Set selection

Filters set=current and a specific 64-char hash are honored. set=all is rejected (HTTP 400) — cross-set aggregation has no well-defined denominator for the strict per-set ranking metric.

Metrics glossary

All metrics shown on the leaderboard and model detail pages are defined here. Each definition includes the formula used to compute it and guidance on when each metric is most useful.

Pass rate rate

Tasks solved / tasks in scope, up to 2 attempts (strict per-set denominator).

Formula: (tasks_passed_attempt_1 + tasks_passed_attempt_2_only) / task_set_size

Includes unattempted tasks as failures. Scope-aware; reflects active filters (set, category, difficulty). Final assisted solve rate with up to 2 attempts; drill-down companion to Solve AUC@2.

First-try pass rate rate

Tasks solved on the first attempt / tasks in scope (strict).

Formula: tasks_passed_attempt_1 / task_set_size

Measures single-shot accuracy without retry credit. Useful when comparing models where the second attempt is not available.

Solve AUC@2 rate

Skill score: full credit for a first-try solve, half for a retry solve. Not the solve rate.

Formula: AUC@2 = (pass@1 + solve@2) / 2

Use as the headline ranking metric. Rewards first-try correctness over fail-then-repair without ignoring the two-attempt protocol. De-saturates the headline that pass_at_n compresses. Significance via paired bootstrap (tier bands), not Wilson.

Repair rate rate

Share of first-attempt failures the model fixed on attempt 2.

Formula: (pass_at_n − pass_at_1) / (1 − pass_at_1); defined as 0 when pass_at_1 = 1.

Conditional recovery skill: high = good at reading compiler/test errors and patching. Profile column, not a ranking metric.

Pass Rate 95% CI rate

95% Wilson confidence interval on the pass rate.

Formula: Wilson score interval: center ± half-width, where n = strict denominator (task_set_size or category/difficulty-scoped count when taskSetHash is provided; falls back to tasks_attempted_distinct for legacy callers).

Use to judge whether a lead over another model is statistically meaningful. Wide CIs indicate too few tasks to draw firm conclusions.

Run success rate rate

Tasks the run solved on its last attempt / tasks attempted in this run.

Formula: COUNT(distinct tasks where last attempt passed) / COUNT(distinct tasks attempted in this run)

Per-run metric for the model's "final answer" on each task. Differs from leaderboard pass_at_n: this denominator is the run's own attempted-task count, not the task set size, so partial runs are not penalised for unattempted tasks.

All-runs pass rate rate

Fraction of tasks the model solved in every single run (strict consistency, also written pass^n).

Formula: tasks where ALL runs produced a passing result / tasks_attempted_distinct

Measures reliability under repetition. High value means the model is unlikely to regress on a re-run, important for CI and production use. Formal name in the literature: pass^n.

Avg attempt score score

Mean per-attempt score on a 0–100 point scale (partial credit). Drill-down only.

Formula: Mean of attempt scores across all results rows: SUM(score) / COUNT(*) over the results table. Each attempt earns 0–100 points based on compile + test outcomes.

Drill-down companion to pass_at_n. Rewards partial credit but not directly comparable to pass rate; use for within-model analysis.

Avg cost / task usd

Average LLM cost per distinct benchmark task in USD.

Formula: SUM(cost_usd) / COUNT(DISTINCT task_id) across all the model's results in scope.

Use to compare operating cost across models with similar pass rates. Does not account for quality. Combine with $/Pass for a cost-efficiency view.

$/Pass usd

Average USD cost per solved task (any-attempt pass).

Formula: SUM(cost_usd) / tasks_passed_distinct across all runs.

Best single cost-efficiency metric. Penalises expensive models that pass few tasks and rewards cheap models with high pass rates.

Latency p50 duration_ms

Median per-task wall time (LLM call + compile + test), in milliseconds.

Formula: 50th percentile of per-task duration_ms: LLM latency + compile time + test time.

Use p50 for a typical-case latency expectation. Unaffected by outlier slow tasks.

Latency p95 duration_ms

95th-percentile per-task wall time. Captures tail latency.

Formula: 95th percentile of per-task duration_ms across all tasks in all runs.

Use p95 to understand worst-case latency. A low p95 means the model rarely stalls, relevant for automated pipelines with timeouts.

Consistency pct

Percentage (0–100) of tasks with the same outcome across every run.

Formula: 100 × tasks where all runs agree (all pass OR all fail) / tasks_attempted_distinct.

High consistency means the model behaves predictably. Low consistency flags flaky tasks or a model sensitive to stochastic generation noise.

Tasks attempted count

Count of distinct tasks the model has attempted at least once.

Formula: COUNT(DISTINCT task_id) over the results table for this model.

Coverage indicator. Strict pass_at_n still counts unattempted tasks as failures; use this to see how much of the active task set the model covered.

Tasks passed count

Distinct tasks solved in any attempt across all runs.

Formula: COUNT(DISTINCT task_id) where best outcome = pass.

Absolute count version of pass_at_n. Useful when comparing models that have attempted different task counts.

Runs count

Total number of benchmark runs recorded for this model.

Formula: COUNT(DISTINCT run_id) for this model.

More runs = more data, tighter confidence intervals, and more reliable pass^n / consistency metrics.

Verified runs count

Runs signed and verified by an independent verifier machine.

Formula: COUNT of runs where the Ed25519 signature was verified by the worker at ingest.

Verified runs have a stronger integrity guarantee (Ed25519 signature verified at ingest).

Transparency

Every benchmark run is signed with an Ed25519 keypair held by the operator's machine and verified by the worker before ingest. The signed payload, public key, and verification record are linked from each run's detail page.

Source code is available on GitHub.