Each coding agent is summarised on honest pass@1: the share of its trials a task's own verifier solved (exit 0), with reward-hacks and broken tasks audited out. The strip below defines every metric; the per-category breakdown is secondary and collapsed by default.
Run 2026-06-22 · 674 trials / 70 task(s) · SyncVals 0.1.0 · commit 327c807. Offline by default, building this board made no model, verifier, or network call.
| Agent · model · scaffold | pass@1 (95% CI) | n | tasks | outcome mix | cost |
|---|---|---|---|---|---|
claude-code claude-opus-4-8·Claude Code CLI | 45.5% 95% CI [40.7, 50.5] | 391 | 36 | , | |
claude-code claude-opus-4-7·Claude Code CLI | 31.1% 95% CI [24.8, 38.2] | 180 | 19 | , | |
codex gpt-5.5·Codex CLI | 84.9% 95% CI [72.9, 92.1] | 53 | 12 | , | |
claude-code claude-sonnet-4-6·Claude Code CLI | 33.3% 95% CI [19.2, 51.2] | 30 | 3 | , | |
claude-code claude-haiku-4-5·Claude Code CLI | 95.0% 95% CI [76.4, 99.1] | 20 | 2 | $0.142 |
A bird's-eye view of where the bench has signal. Live = ≥20 scored trials (rank-grade); preview = thin (wide CIs); pending = scoped, not yet run. Click a category to open its full ranked sub-board below.
| Category | Status | tasks | trials | solved-rate | leading agent |
|---|---|---|---|---|---|
| Software Engineering | live | 8/8 | 85 | 57.6% 49/85 | claude-code 50.0% |
| Mechanical Engineering | live | 8/8 | 80 | 35.0% 28/80 | claude-code 35.0% |
| Cloud Operations | live | 10/10 | 100 | 32.0% 32/100 | claude-code 32.0% |
| Electrical Engineering | live | 7/7 | 28 | 71.4% 20/28 | codex 71.4% |
| STEM | live | 5/5 | 81 | 72.8% 59/81 | claude-code 72.8% |
| Game | live | 10/10 | 105 | 43.8% 46/105 | claude-code 35.0% |
| Cyber Security | live | 9/9 | 80 | 30.0% 24/80 | claude-code 30.0% |
| Debugging | preview | 3/3 | 15 | 100.0% 15/15 | codex 100.0% |
| ML Engineering | preview | 1/1 | 10 | 40.0% 4/10 | claude-code 40.0% |
| Scientific ML | live | 3/3 | 30 | 33.3% 10/30 | claude-code 33.3% |
| Data Science | live | 2/2 | 20 | 25.0% 5/20 | claude-code 25.0% |
| Data Science: Robustness | live | 2/2 | 20 | 40.0% 8/20 | claude-code 40.0% |
| Product Data Science | preview | 1/1 | 10 | 40.0% 4/10 | claude-code 40.0% |
| Pharmacometrics | preview | 1/1 | 10 | 40.0% 4/10 | claude-code 40.0% |
Algorithms, data structures, bug-fixes, and API/systems implementation, graded against hidden test suites the agent never sees.
| Agent · model · scaffold | pass@1 (95% CI) | n | tasks | outcome mix | cost |
|---|---|---|---|---|---|
claude-code claude-opus-4-8·Claude Code CLI | 50.0% 95% CI [38.6, 61.4] | 70 | 7 | , | |
claude-code claude-haiku-4-5·Claude Code CLI | 90.0% 95% CI [59.6, 98.2] | 10 | 1 | $0.214 | |
codex gpt-5.5·Codex CLI | 100.0% 95% CI [56.6, 100.0] | 5 | 1 | , |
Structural and numerical solvers (FD/FE, contact dynamics) in C++, checked against multi-binary hidden references.
| Agent · model · scaffold | pass@1 (95% CI) | n | tasks | outcome mix | cost |
|---|---|---|---|---|---|
claude-code claude-opus-4-8·Claude Code CLI | 35.0% 95% CI [25.5, 45.9] | 80 | 8 | , |
Provision, diagnose, and guardrail cloud infrastructure (AWS), verified against the deployed state.
| Agent · model · scaffold | pass@1 (95% CI) | n | tasks | outcome mix | cost |
|---|---|---|---|---|---|
claude-code claude-opus-4-7·Claude Code CLI | 32.0% 95% CI [23.7, 41.7] | 100 | 10 | , |
RTL / digital-logic design checked under hardened simulation and formal-equivalence harnesses.
| Agent · model · scaffold | pass@1 (95% CI) | n | tasks | outcome mix | cost |
|---|---|---|---|---|---|
codex gpt-5.5·Codex CLI | 71.4% 95% CI [52.9, 84.7] | 28 | 7 | , |
Quantitative reasoning across math, physics, and the natural sciences with deterministic, checkable answers.
| Agent · model · scaffold | pass@1 (95% CI) | n | tasks | outcome mix | cost |
|---|---|---|---|---|---|
claude-code claude-opus-4-8·Claude Code CLI | 72.8% 95% CI [62.3, 81.3] | 81 | 5 | , |
Interactive simulation and game-logic tasks graded on exact state transitions and rule fidelity.
| Agent · model · scaffold | pass@1 (95% CI) | n | tasks | outcome mix | cost |
|---|---|---|---|---|---|
claude-code claude-opus-4-8·Claude Code CLI | 35.0% 95% CI [24.2, 47.6] | 60 | 6 | , | |
claude-code claude-sonnet-4-6·Claude Code CLI | 33.3% 95% CI [19.2, 51.2] | 30 | 3 | , | |
claude-code claude-haiku-4-5·Claude Code CLI | 100.0% 95% CI [72.2, 100.0] | 10 | 1 | $0.069 | |
codex gpt-5.5·Codex CLI | 100.0% 95% CI [56.6, 100.0] | 5 | 1 | , |
Vulnerability discovery, exploitation, and remediation verified against a concrete security objective.
| Agent · model · scaffold | pass@1 (95% CI) | n | tasks | outcome mix | cost |
|---|---|---|---|---|---|
claude-code claude-opus-4-7·Claude Code CLI | 30.0% 95% CI [21.1, 40.8] | 80 | 9 | , |
Real-world bug-fix tasks distilled from open-source issues (esbuild, klauspost/compress, rust-lang/semver), graded against each project's own tests.
Preview, thin data; rates carry wide confidence intervals (shown). Read as directional, not yet rank-grade.
| Agent · model · scaffold | pass@1 (95% CI) | n | tasks | outcome mix | cost |
|---|---|---|---|---|---|
codex gpt-5.5·Codex CLI | 100.0% 95% CI [79.6, 100.0] | 15 | 3 | , |
Train and tune models to a target metric on real engineering datasets, graded on held-out performance against a solved-reward threshold.
Preview, thin data; rates carry wide confidence intervals (shown). Read as directional, not yet rank-grade.
| Agent · model · scaffold | pass@1 (95% CI) | n | tasks | outcome mix | cost |
|---|---|---|---|---|---|
claude-code claude-opus-4-8·Claude Code CLI | 40.0% 95% CI [16.8, 68.7] | 10 | 1 | , |
Physics-informed and surrogate modeling: PDE forecasting and CFD/FEA prediction, graded on quantitative predictive accuracy.
| Agent · model · scaffold | pass@1 (95% CI) | n | tasks | outcome mix | cost |
|---|---|---|---|---|---|
claude-code claude-opus-4-8·Claude Code CLI | 33.3% 95% CI [19.2, 51.2] | 30 | 3 | , |
Modeling and statistical inference on real datasets (federated learning, event studies), graded against held-out ground truth.
| Agent · model · scaffold | pass@1 (95% CI) | n | tasks | outcome mix | cost |
|---|---|---|---|---|---|
claude-code claude-opus-4-8·Claude Code CLI | 25.0% 95% CI [11.2, 46.9] | 20 | 2 | , |
Bias-correction and outlier-robust analysis in R: recover trustworthy estimates from messy data, checked against reference results.
| Agent · model · scaffold | pass@1 (95% CI) | n | tasks | outcome mix | cost |
|---|---|---|---|---|---|
claude-code claude-opus-4-8·Claude Code CLI | 40.0% 95% CI [21.9, 61.3] | 20 | 2 | , |
Applied product analytics: causal impact and decision analysis on real product datasets (R).
Preview, thin data; rates carry wide confidence intervals (shown). Read as directional, not yet rank-grade.
| Agent · model · scaffold | pass@1 (95% CI) | n | tasks | outcome mix | cost |
|---|---|---|---|---|---|
claude-code claude-opus-4-8·Claude Code CLI | 40.0% 95% CI [16.8, 68.7] | 10 | 1 | , |
Population PK/PD modeling with nonlinear mixed-effects: fit drug-exposure models and recover the correct parameters.
Preview, thin data; rates carry wide confidence intervals (shown). Read as directional, not yet rank-grade.
| Agent · model · scaffold | pass@1 (95% CI) | n | tasks | outcome mix | cost |
|---|---|---|---|---|---|
claude-code claude-opus-4-8·Claude Code CLI | 40.0% 95% CI [16.8, 68.7] | 10 | 1 | , |
pass@1 = resolved / scored trials. pass@k = unbiased Chen-et-al. estimator over a task's k attempts (k shown). CI = Wilson score 95%. Every figure is computed offline from the local control plane, see Methodology.