Pass rate
Tasks solved / tasks in scope, up to 2 attempts (strict per-set denominator).
Formula: (tasks_passed_attempt_1 + tasks_passed_attempt_2_only) / task_set_size
Includes unattempted tasks as failures. Scope-aware; reflects active filters (set, category, difficulty). Final assisted solve rate with up to 2 attempts; drill-down companion to Solve AUC@2.
Avg cost / task
Average LLM cost per distinct benchmark task in USD.
Formula: SUM(cost_usd) / COUNT(DISTINCT task_id) across all the model's results in scope.
Use to compare operating cost across models with similar pass rates. Does not account for quality. Combine with $/Pass for a cost-efficiency view.
Latency p50
Median per-task wall time (LLM call + compile + test), in milliseconds.
Formula: 50th percentile of per-task duration_ms: LLM latency + compile time + test time.
Use p50 for a typical-case latency expectation. Unaffected by outlier slow tasks.
Avg attempt score
Mean per-attempt score on a 0–100 point scale (partial credit). Drill-down only.
Formula: Mean of attempt scores across all results rows: SUM(score) / COUNT(*) over the results table. Each attempt earns 0–100 points based on compile + test outcomes.
Drill-down companion to pass_at_n. Rewards partial credit but not directly comparable to pass rate; use for within-model analysis.
All-runs pass rate
Fraction of tasks the model solved in every single run (strict consistency, also written pass^n).
Formula: tasks where ALL runs produced a passing result / tasks_attempted_distinct
Measures reliability under repetition. High value means the model is unlikely to regress on a re-run, important for CI and production use. Formal name in the literature: pass^n.
$/Pass
Average USD cost per solved task (any-attempt pass).
Formula: SUM(cost_usd) / tasks_passed_distinct across all runs.
Best single cost-efficiency metric. Penalises expensive models that pass few tasks and rewards cheap models with high pass rates.
Latency p95
95th-percentile per-task wall time. Captures tail latency.
Formula: 95th percentile of per-task duration_ms across all tasks in all runs.
Use p95 to understand worst-case latency. A low p95 means the model rarely stalls, relevant for automated pipelines with timeouts.
Overview
Claude Fable 5 has run on 3 occasions, attempting 110 tasks with an average score of 79.8 / 100.
Settings
Generation parameters used across this model's runs. "varies" indicates the value differed between runs.
- Temperature
- varies
- Thinking budget
- varies
- Avg tokens / run (input + output)
- 410,323
- Consistency
- 90%
History
Cost
Failure modes
- AL0104 114 Syntax error, ')' expected view all →
- AL0000 68 App generation failed view all →
- AL0111 27 Semicolon expected. Add a semicolon (;) to terminate the statement. view all →
- AL0360 17 Text literal was not properly terminated. Use the character ' to terminate the literal. view all →
- AL0107 8 Syntax error, identifier expected. Provide a valid name (letters, digits, and underscores only). view all →
- AL0169 8 The option value 'ReadOnly' is not valid. Check the enum definition for valid values. view all →
- AL0761 8 An incorrect value was used for the category. One of the values of the enum 2000000001 EventCategory is expected which is available in platform version 22.0.0.0 and higher. view all →
- AL0118 6 The name 'CreateSequentialGuid' does not exist in the current context. view all →
- AL0198 6 Expected one of the application object keywords (table, tableextension, page, pageextension, pagecustomization, profile, profileextension, codeunit, report, reportextension, xmlport, query, controladdin, dotnet, enum, enumextension, interface, permissionset, permissionsetextension, entitlement) view all →
- AL0132 5 'Record Product' does not contain a definition for 'Product Code' view all →
Shortcomings
AL concepts Claude Fable 5 struggles with. Click a row for description, correct pattern, and observed error codes.
Shortcomings analysis queued
Queued for analysis. This section will populate once the run is processed.
Recent runs
| Started | Run | Model | Tasks | Score | Cost | Duration | Status |
|---|---|---|---|---|---|---|---|
| 9d ago | 021a60c0-85d… | 101/110 | 81.4 / 100 | $12.37 | 1h 18m | completed | |
| 10d ago | 05dc356f-e4b… | 101/110 | 78.7 / 100 | $12.41 | 1h 43m | completed | |
| 10d ago | 9ab3a769-521… | 101/110 | 79.4 / 100 | $13.15 | 2h 3m | completed |
Methodology
Pass rate counts unattempted tasks as failures (strict denominator). Avg score is the mean of every attempt the model produced — failed first tries that triggered a retry contribute one observation each, pulling the mean down. See the about page for the full breakdown and unit conventions.