Claude Opus 4.7 · CentralGauge

Pass@N

Pass rate

Tasks solved / tasks in scope, up to 2 attempts (strict per-set denominator).

Formula: (tasks_passed_attempt_1 + tasks_passed_attempt_2_only) / task_set_size

Includes unattempted tasks as failures. Scope-aware; reflects active filters (set, category, difficulty). Final assisted solve rate with up to 2 attempts; drill-down companion to Solve AUC@2.

HumanEval paper (Chen et al., 2021) ↗

88.2%

Tasks pass 97/110

1st: 80 2nd: 17 Failed: 13

Avg cost / task

Average LLM cost per distinct benchmark task in USD.

Formula: SUM(cost_usd) / COUNT(DISTINCT task_id) across all the model's results in scope.

Use to compare operating cost across models with similar pass rates. Does not account for quality. Combine with $/Pass for a cost-efficiency view.

$0.22

Latency p50

Median per-task wall time (LLM call + compile + test), in milliseconds.

Formula: 50th percentile of per-task duration_ms: LLM latency + compile time + test time.

Use p50 for a typical-case latency expectation. Unaffected by outlier slow tasks.

19.9s

Avg score

Avg attempt score

Mean per-attempt score on a 0–100 point scale (partial credit). Drill-down only.

Formula: Mean of attempt scores across all results rows: SUM(score) / COUNT(*) over the results table. Each attempt earns 0–100 points based on compile + test outcomes.

Drill-down companion to pass_at_n. Rewards partial credit but not directly comparable to pass rate; use for within-model analysis.

70.0 / 100

All-runs pass rate

Fraction of tasks the model solved in every single run (strict consistency, also written pass^n).

Formula: tasks where ALL runs produced a passing result / tasks_attempted_distinct

Measures reliability under repetition. High value means the model is unlikely to regress on a re-run, important for CI and production use. Formal name in the literature: pass^n.

76.4%

$/Pass

Average USD cost per solved task (any-attempt pass).

Formula: SUM(cost_usd) / tasks_passed_distinct across all runs.

Best single cost-efficiency metric. Penalises expensive models that pass few tasks and rewards cheap models with high pass rates.

$0.2455

Latency p95

95th-percentile per-task wall time. Captures tail latency.

Formula: 95th percentile of per-task duration_ms across all tasks in all runs.

Use p95 to understand worst-case latency. A low p95 means the model rarely stalls, relevant for automated pipelines with timeouts.

2m 37s

Overview

Claude Opus 4.7 has run on 6 occasions, attempting 110 tasks with an average score of 70.0 / 100.

Settings

Generation parameters used across this model's runs. "varies" indicates the value differed between runs.

Temperature: varies
Thinking budget: varies
Avg tokens / run (input + output): 307,980
Consistency: 81.8%

History

6 runs · oldest 20d ago · latest 1mo ago

Cost

Failure modes

AL0104 258 Syntax error, '=' expected view all →
AL0000 222 App generation failed view all →
AL0107 99 Syntax error, identifier expected. Provide a valid name (letters, digits, and underscores only). view all →
AL0111 85 Semicolon expected. Add a semicolon (;) to terminate the statement. view all →
AL0105 78 Syntax error, identifier expected; 'key' is a keyword view all →
AL0132 62 'FieldRef' does not contain a definition for 'CreateInStream' view all →
AL0169 48 The option value 'Masked' is not valid. Check the enum definition for valid values. view all →
AL0126 29 No overload for method 'Clear' takes 1 arguments. Candidates: 'Clear()' defined in Codeunit 'CG H054 Cache' by the extension CentralGauge_CG-AL-H054_2 by CentralGauge (1.0.0.0) view all →
AL0198 24 Expected one of the application object keywords (table, tableextension, page, pageextension, pagecustomization, profile, profileextension, codeunit, report, reportextension, xmlport, query, controladdin, dotnet, enum, enumextension, interface, permissionset, permissionsetextension, entitlement) view all →
AL0224 22 Expression expected. Provide a valid expression (variable, constant, calculation, or method call). view all →

Shortcomings

AL concepts Claude Opus 4.7 struggles with. Click a row for description, correct pattern, and observed error codes.

Shortcomings analysis queued

Queued for analysis. This section will populate once the run is processed.

Recent runs

Runs
Started	Run	Model	Tasks	Score	Cost	Duration	Status
20d ago	68e1635c-6be…	Claude Opus 4.7 claude	91/110	69.8 / 100	$3.65	1h 9m	completed
21d ago	eae41fb1-f1a…	Claude Opus 4.7 claude	91/110	71.1 / 100	$3.83	1h 11m	completed
21d ago	f10ed110-b8a…	Claude Opus 4.7 claude	93/110	70.5 / 100	$3.91	1h 19m	completed
1mo ago	e5dcc595-a28…	Claude Opus 4.7 claude	90/110	68.9 / 100	$4.22	2h 6m	completed
1mo ago	7d67b00d-60e…	Claude Opus 4.7 claude	90/110	68.6 / 100	$4.17	2h 12m	completed
1mo ago	47caf9d8-139…	Claude Opus 4.7 claude	90/110	71.4 / 100	$4.03	1h 32m	completed

See all 6 runs →

Methodology

Pass rate counts unattempted tasks as failures (strict denominator). Avg score is the mean of every attempt the model produced — failed first tries that triggered a retry contribute one observation each, pulling the mean down. See the about page for the full breakdown and unit conventions.