Pass@N

Pass rate

Tasks solved / tasks in scope, up to 2 attempts (strict per-set denominator).

Formula: (tasks_passed_attempt_1 + tasks_passed_attempt_2_only) / task_set_size

Includes unattempted tasks as failures. Scope-aware; reflects active filters (set, category, difficulty). Final assisted solve rate with up to 2 attempts; drill-down companion to Solve AUC@2.

92.7%
Tasks pass 102/110
1st: 90 2nd: 12 Failed: 8
Avg cost / task

Avg cost / task

Average LLM cost per distinct benchmark task in USD.

Formula: SUM(cost_usd) / COUNT(DISTINCT task_id) across all the model's results in scope.

Use to compare operating cost across models with similar pass rates. Does not account for quality. Combine with $/Pass for a cost-efficiency view.

$0.34 ↓ -59%
Latency p50

Latency p50

Median per-task wall time (LLM call + compile + test), in milliseconds.

Formula: 50th percentile of per-task duration_ms: LLM latency + compile time + test time.

Use p50 for a typical-case latency expectation. Unaffected by outlier slow tasks.

25.2s
Avg score

Avg attempt score

Mean per-attempt score on a 0–100 point scale (partial credit). Drill-down only.

Formula: Mean of attempt scores across all results rows: SUM(score) / COUNT(*) over the results table. Each attempt earns 0–100 points based on compile + test outcomes.

Drill-down companion to pass_at_n. Rewards partial credit but not directly comparable to pass rate; use for within-model analysis.

79.8 / 100 ↑ +9.8 pts
All-runs pass rate

All-runs pass rate

Fraction of tasks the model solved in every single run (strict consistency, also written pass^n).

Formula: tasks where ALL runs produced a passing result / tasks_attempted_distinct

Measures reliability under repetition. High value means the model is unlikely to regress on a re-run, important for CI and production use. Formal name in the literature: pass^n.

90.9%
$/Pass

$/Pass

Average USD cost per solved task (any-attempt pass).

Formula: SUM(cost_usd) / tasks_passed_distinct across all runs.

Best single cost-efficiency metric. Penalises expensive models that pass few tasks and rewards cheap models with high pass rates.

$0.3719
Latency p95

Latency p95

95th-percentile per-task wall time. Captures tail latency.

Formula: 95th percentile of per-task duration_ms across all tasks in all runs.

Use p95 to understand worst-case latency. A low p95 means the model rarely stalls, relevant for automated pipelines with timeouts.

2m 55s

Overview

Claude Fable 5 has run on 3 occasions, attempting 110 tasks with an average score of 79.8 / 100.

Settings

Generation parameters used across this model's runs. "varies" indicates the value differed between runs.

Temperature
varies
Thinking budget
varies
Avg tokens / run (input + output)
410,323
Consistency
90%

History

1
2
3
3 runs · oldest 9d ago · latest 10d ago

Cost

meanp95

Failure modes

  • AL0104 114 Syntax error, ')' expected view all →
  • AL0000 68 App generation failed view all →
  • AL0111 27 Semicolon expected. Add a semicolon (;) to terminate the statement. view all →
  • AL0360 17 Text literal was not properly terminated. Use the character ' to terminate the literal. view all →
  • AL0107 8 Syntax error, identifier expected. Provide a valid name (letters, digits, and underscores only). view all →
  • AL0169 8 The option value 'ReadOnly' is not valid. Check the enum definition for valid values. view all →
  • AL0761 8 An incorrect value was used for the category. One of the values of the enum 2000000001 EventCategory is expected which is available in platform version 22.0.0.0 and higher. view all →
  • AL0118 6 The name 'CreateSequentialGuid' does not exist in the current context. view all →
  • AL0198 6 Expected one of the application object keywords (table, tableextension, page, pageextension, pagecustomization, profile, profileextension, codeunit, report, reportextension, xmlport, query, controladdin, dotnet, enum, enumextension, interface, permissionset, permissionsetextension, entitlement) view all →
  • AL0132 5 'Record Product' does not contain a definition for 'Product Code' view all →

Shortcomings

AL concepts Claude Fable 5 struggles with. Click a row for description, correct pattern, and observed error codes.

Shortcomings analysis queued

Queued for analysis. This section will populate once the run is processed.

Recent runs

Runs
StartedRunModelTasksScoreCostDurationStatus
9d ago021a60c0-85d…101/11081.4 / 100$12.371h 18mcompleted
10d ago05dc356f-e4b…101/11078.7 / 100$12.411h 43mcompleted
10d ago9ab3a769-521…101/11079.4 / 100$13.152h 3mcompleted

See all 3 runs →

Methodology

Pass rate counts unattempted tasks as failures (strict denominator). Avg score is the mean of every attempt the model produced — failed first tries that triggered a retry contribute one observation each, pulling the mean down. See the about page for the full breakdown and unit conventions.