Pass rate
Tasks solved / tasks in scope, up to 2 attempts (strict per-set denominator).
Formula: (tasks_passed_attempt_1 + tasks_passed_attempt_2_only) / task_set_size
Includes unattempted tasks as failures. Scope-aware; reflects active filters (set, category, difficulty). Final assisted solve rate with up to 2 attempts; drill-down companion to Solve AUC@2.
Avg cost / task
Average LLM cost per distinct benchmark task in USD.
Formula: SUM(cost_usd) / COUNT(DISTINCT task_id) across all the model's results in scope.
Use to compare operating cost across models with similar pass rates. Does not account for quality. Combine with $/Pass for a cost-efficiency view.
Latency p50
Median per-task wall time (LLM call + compile + test), in milliseconds.
Formula: 50th percentile of per-task duration_ms: LLM latency + compile time + test time.
Use p50 for a typical-case latency expectation. Unaffected by outlier slow tasks.
Avg attempt score
Mean per-attempt score on a 0–100 point scale (partial credit). Drill-down only.
Formula: Mean of attempt scores across all results rows: SUM(score) / COUNT(*) over the results table. Each attempt earns 0–100 points based on compile + test outcomes.
Drill-down companion to pass_at_n. Rewards partial credit but not directly comparable to pass rate; use for within-model analysis.
All-runs pass rate
Fraction of tasks the model solved in every single run (strict consistency, also written pass^n).
Formula: tasks where ALL runs produced a passing result / tasks_attempted_distinct
Measures reliability under repetition. High value means the model is unlikely to regress on a re-run, important for CI and production use. Formal name in the literature: pass^n.
$/Pass
Total cost divided by number of distinct tasks passed. Lower is better.
Formula: SUM(cost_usd) / tasks_passed_distinct across all runs.
Best single cost-efficiency metric. Penalises expensive models that pass few tasks and rewards cheap models with high pass rates.
Latency p95
95th-percentile per-task wall time. Captures tail latency.
Formula: 95th percentile of per-task duration_ms across all tasks in all runs.
Use p95 to understand worst-case latency. A low p95 means the model rarely stalls, relevant for automated pipelines with timeouts.
Overview
Gemini 3.1 Pro Preview has run on 3 occasions, attempting 110 tasks with an average score of 73.9 / 100.
Settings
Generation parameters used across this model's runs. "varies" indicates the value differed between runs.
- Temperature
- varies
- Thinking budget
- varies
- Avg tokens / run (input + output)
- 182,458
- Consistency
- 86.4%
History
Cost
Failure modes
- AL0132 116 'Record Customer' does not contain a definition for 'Preferred Contact Method' view all →
- AL0104 102 Syntax error, ')' expected view all →
- AL0000 96 App generation failed view all →
- AL0111 51 Semicolon expected. Add a semicolon (;) to terminate the statement. view all →
- AL0360 36 Text literal was not properly terminated. Use the character ' to terminate the literal. view all →
- AL0198 32 Expected one of the application object keywords (table, tableextension, page, pageextension, pagecustomization, profile, profileextension, codeunit, report, reportextension, xmlport, query, controladdin, dotnet, enum, enumextension, interface, permissionset, permissionsetextension, entitlement) view all →
- AL0169 16 The option value 'Integration' is not valid. Check the enum definition for valid values. view all →
- AL0133 15 Argument 2: cannot convert from 'Text' to 'SecretText' view all →
- AL0122 13 Cannot implicitly convert type 'None' to 'Text'. Use an explicit conversion or change the type. view all →
- AL0126 12 No overload for method 'Clear' takes 1 arguments. Candidates: 'Clear()' defined in Codeunit 'CG H054 Cache' by the extension CentralGauge_CG-AL-H054_1 by CentralGauge (1.0.0.0) view all →
Shortcomings
AL concepts Gemini 3.1 Pro Preview struggles with. Click a row for description, correct pattern, and observed error codes.
Shortcomings analysis queued
Queued for analysis. This section will populate once the run is processed.
Recent runs
| Started | Run | Model | Tasks | Score | Cost | Duration | Status |
|---|---|---|---|---|---|---|---|
| 18h ago | cd36fbe5-aaf… | 94/110 | 73.4 / 100 | $1.11 | 5h 11m | completed | |
| 22h ago | a6a3e726-aca… | 98/110 | 73.3 / 100 | $1.11 | 6h 25m | completed | |
| 1d ago | 97865fb5-e26… | 97/110 | 75.1 / 100 | $1.02 | 6h 23m | completed |
Methodology
Pass rate counts unattempted tasks as failures (strict denominator). Avg score is the mean of every attempt the model produced — failed first tries that triggered a retry contribute one observation each, pulling the mean down. See the about page for the full breakdown and unit conventions.