Score

Score

Average score per attempt row (0–1). Rewards partial credit.

Formula: Mean of all attempt scores across all results rows: SUM(score) / COUNT(*) over the results table.

Ranks models that make consistent partial progress on hard tasks. A model that scores 0.5 on every task beats one that passes half and fails the rest on this metric.

0.00
Tasks pass 0/0
1st: 0 2nd: 0 Failed: 0
Cost / run

Cost / run

Average total LLM cost per benchmark run in USD.

Formula: SUM(cost_usd) / run_count across all runs for this model.

Use to compare operating cost across models with similar pass rates. Does not account for quality. Combine with $/Pass for a cost-efficiency view.

$0.00
Latency p50

Latency p50

Median per-task wall time (LLM call + compile + test), in milliseconds.

Formula: 50th percentile of per-task duration_ms: LLM latency + compile time + test time.

Use p50 for a typical-case latency expectation. Unaffected by outlier slow tasks.

0ms
Pass Rate

Pass Rate

Fraction of distinct tasks solved in any attempt across all runs.

Formula: (tasks_passed_attempt_1 + tasks_passed_attempt_2_only) / tasks_attempted_distinct

Primary ranking metric. Compare models here first: it directly measures how often the model delivers working code.

0.0% 95% CI: [0.0–100.0]%
pass^n (strict)

pass^n (strict)

Fraction of tasks the model solved in every single run (strict consistency).

Formula: tasks where ALL runs produced a passing result / tasks_attempted_distinct

Measures reliability under repetition. High pass^n means the model is unlikely to regress on a re-run, important for CI and production use.

0.0%
$/Pass

$/Pass

Total cost divided by number of distinct tasks passed. Lower is better.

Formula: SUM(cost_usd) / tasks_passed_distinct across all runs.

Best single cost-efficiency metric. Penalises expensive models that pass few tasks and rewards cheap models with high pass rates.

Latency p95

Latency p95

95th-percentile per-task wall time. Captures tail latency.

Formula: 95th percentile of per-task duration_ms across all tasks in all runs.

Use p95 to understand worst-case latency. A low p95 means the model rarely stalls, relevant for automated pipelines with timeouts.

0ms

Overview

xAI: Grok 4.3 has run on 0 occasions, attempting 0 tasks with an average score of 0.00.

Settings

Generation parameters used across this model's runs. "varies" indicates the value differed between runs.

Temperature
varies
Thinking budget
varies
Avg tokens / run
Consistency

History

1.00.0

Cost

No cost data yet.

Shortcomings

AL concepts xAI: Grok 4.3 struggles with. Click a row for description, correct pattern, and observed error codes.

No shortcomings analyzed yet

Shortcomings analysis is on the roadmap. The first analyzer run is scheduled for the P8 release; until then, this section reflects no data.

See methodology

Recent runs

Runs
StartedModelTasksScoreCostDurationStatus

See all 0 runs →

Methodology

Scores are computed per task, averaged across attempts. See the about page for details.