SyncValsverifier → artifact → classifier → verdict
SyncVals · Results

Coding-Agent Results

Each coding agent is summarised on honest pass@1: the share of its trials a task's own verifier solved (exit 0), with reward-hacks and broken tasks audited out. The strip below defines every metric; the per-category breakdown is secondary and collapsed by default.

Coverage overview, not a head-to-head ranking. Each agent ran a different subset of tasks (68 of 70 tasks were attempted by a single agent), so pass@1 is not directly comparable across agents. Read each row as "how this agent did on the tasks it ran", not "agent A beat agent B".

Run 2026-06-22 · 674 trials / 70 task(s) · SyncVals 0.1.0 · commit 327c807. Offline by default, building this board made no model, verifier, or network call.

5 agent/model line(s) over 674 scored trials, each on a different task subset. Most-evaluated: claude-code / claude-opus-4-8 (n=391, 45.5% on its own tasks, 95% CI [40.7, 50.5]); pass@1 is not comparable across agents that ran different task subsets. Category coverage: 10 live, 4 preview.
How to read this , the ranked unit and each metric, defined inline
Ranked unitagent × modelOne line per coding agent paired with the model that drove it.
Metricpass@1Solved ÷ scored trials. Solved = the task's own verifier exits 0.
nscored trialsThe denominator. Fewer than 20 → flagged preview, CI stays wide.
Honesty95% CI + outcomesWilson 95% band (the whisker) on every rate; the bar splits each agent's runs into the 5 outcomes, green honest, red reward-hack / bad-task, amber harness.
Results by agent , all 5 scored agent×model lines, each on the tasks it ran (not a cross-agent ranking)
Outcome barGOOD SUCCESSBAD SUCCESSGOOD FAILUREBAD FAILUREHARNESS ERRORgreen = honest, red = reward-hack / bad task, amber = harness/budget
Agent · model · scaffoldpass@1 (95% CI)ntasksoutcome mixcost
claude-code
claude-opus-4-8·Claude Code CLI
45.5%
95% CI [40.7, 50.5]
39136,
claude-code
claude-opus-4-7·Claude Code CLI
31.1%
95% CI [24.8, 38.2]
18019,
codex
gpt-5.5·Codex CLI
84.9%
95% CI [72.9, 92.1]
5312,
claude-code
claude-sonnet-4-6·Claude Code CLI
33.3%
95% CI [19.2, 51.2]
303,
claude-coden=20 · preview
claude-haiku-4-5·Claude Code CLI
95.0%
95% CI [76.4, 99.1]
202$0.142
Coverage by category , 10 of 14 categories have rank-grade data; the rest are preview / scoped

A bird's-eye view of where the bench has signal. Live = ≥20 scored trials (rank-grade); preview = thin (wide CIs); pending = scoped, not yet run. Click a category to open its full ranked sub-board below.

CategoryStatustaskstrialssolved-rateleading agent
Software Engineeringlive8/88557.6% 49/85claude-code 50.0%
Mechanical Engineeringlive8/88035.0% 28/80claude-code 35.0%
Cloud Operationslive10/1010032.0% 32/100claude-code 32.0%
Electrical Engineeringlive7/72871.4% 20/28codex 71.4%
STEMlive5/58172.8% 59/81claude-code 72.8%
Gamelive10/1010543.8% 46/105claude-code 35.0%
Cyber Securitylive9/98030.0% 24/80claude-code 30.0%
Debuggingpreview3/315100.0% 15/15codex 100.0%
ML Engineeringpreview1/11040.0% 4/10claude-code 40.0%
Scientific MLlive3/33033.3% 10/30claude-code 33.3%
Data Sciencelive2/22025.0% 5/20claude-code 25.0%
Data Science: Robustnesslive2/22040.0% 8/20claude-code 40.0%
Product Data Sciencepreview1/11040.0% 4/10claude-code 40.0%
Pharmacometricspreview1/11040.0% 4/10claude-code 40.0%
Per-category detail , expand a category for its own ranked sub-board
Software Engineeringlive8 task(s) · 49/85 resolved · 85 trials

Algorithms, data structures, bug-fixes, and API/systems implementation, graded against hidden test suites the agent never sees.

Agent · model · scaffoldpass@1 (95% CI)ntasksoutcome mixcost
claude-code
claude-opus-4-8·Claude Code CLI
50.0%
95% CI [38.6, 61.4]
707,
claude-coden=10 · preview
claude-haiku-4-5·Claude Code CLI
90.0%
95% CI [59.6, 98.2]
101$0.214
codexn=5 · preview
gpt-5.5·Codex CLI
100.0%
95% CI [56.6, 100.0]
51,
Mechanical Engineeringlive8 task(s) · 28/80 resolved · 80 trials

Structural and numerical solvers (FD/FE, contact dynamics) in C++, checked against multi-binary hidden references.

Agent · model · scaffoldpass@1 (95% CI)ntasksoutcome mixcost
claude-code
claude-opus-4-8·Claude Code CLI
35.0%
95% CI [25.5, 45.9]
808,
Cloud Operationslive10 task(s) · 32/100 resolved · 100 trials

Provision, diagnose, and guardrail cloud infrastructure (AWS), verified against the deployed state.

Agent · model · scaffoldpass@1 (95% CI)ntasksoutcome mixcost
claude-code
claude-opus-4-7·Claude Code CLI
32.0%
95% CI [23.7, 41.7]
10010,
Electrical Engineeringlive7 task(s) · 20/28 resolved · 28 trials

RTL / digital-logic design checked under hardened simulation and formal-equivalence harnesses.

Agent · model · scaffoldpass@1 (95% CI)ntasksoutcome mixcost
codex
gpt-5.5·Codex CLI
71.4%
95% CI [52.9, 84.7]
287,
STEMlive5 task(s) · 59/81 resolved · 81 trials

Quantitative reasoning across math, physics, and the natural sciences with deterministic, checkable answers.

Agent · model · scaffoldpass@1 (95% CI)ntasksoutcome mixcost
claude-code
claude-opus-4-8·Claude Code CLI
72.8%
95% CI [62.3, 81.3]
815,
Gamelive10 task(s) · 46/105 resolved · 105 trials

Interactive simulation and game-logic tasks graded on exact state transitions and rule fidelity.

Agent · model · scaffoldpass@1 (95% CI)ntasksoutcome mixcost
claude-code
claude-opus-4-8·Claude Code CLI
35.0%
95% CI [24.2, 47.6]
606,
claude-code
claude-sonnet-4-6·Claude Code CLI
33.3%
95% CI [19.2, 51.2]
303,
claude-coden=10 · preview
claude-haiku-4-5·Claude Code CLI
100.0%
95% CI [72.2, 100.0]
101$0.069
codexn=5 · preview
gpt-5.5·Codex CLI
100.0%
95% CI [56.6, 100.0]
51,
Cyber Securitylive9 task(s) · 24/80 resolved · 80 trials

Vulnerability discovery, exploitation, and remediation verified against a concrete security objective.

Agent · model · scaffoldpass@1 (95% CI)ntasksoutcome mixcost
claude-code
claude-opus-4-7·Claude Code CLI
30.0%
95% CI [21.1, 40.8]
809,
Debuggingpreview3 task(s) · 15/15 resolved · 15 trials

Real-world bug-fix tasks distilled from open-source issues (esbuild, klauspost/compress, rust-lang/semver), graded against each project's own tests.

Preview, thin data; rates carry wide confidence intervals (shown). Read as directional, not yet rank-grade.

Agent · model · scaffoldpass@1 (95% CI)ntasksoutcome mixcost
codexn=15 · preview
gpt-5.5·Codex CLI
100.0%
95% CI [79.6, 100.0]
153,
ML Engineeringpreview1 task(s) · 4/10 resolved · 10 trials

Train and tune models to a target metric on real engineering datasets, graded on held-out performance against a solved-reward threshold.

Preview, thin data; rates carry wide confidence intervals (shown). Read as directional, not yet rank-grade.

Agent · model · scaffoldpass@1 (95% CI)ntasksoutcome mixcost
claude-coden=10 · preview
claude-opus-4-8·Claude Code CLI
40.0%
95% CI [16.8, 68.7]
101,
Scientific MLlive3 task(s) · 10/30 resolved · 30 trials

Physics-informed and surrogate modeling: PDE forecasting and CFD/FEA prediction, graded on quantitative predictive accuracy.

Agent · model · scaffoldpass@1 (95% CI)ntasksoutcome mixcost
claude-code
claude-opus-4-8·Claude Code CLI
33.3%
95% CI [19.2, 51.2]
303,
Data Sciencelive2 task(s) · 5/20 resolved · 20 trials

Modeling and statistical inference on real datasets (federated learning, event studies), graded against held-out ground truth.

Agent · model · scaffoldpass@1 (95% CI)ntasksoutcome mixcost
claude-coden=20 · preview
claude-opus-4-8·Claude Code CLI
25.0%
95% CI [11.2, 46.9]
202,
Data Science: Robustnesslive2 task(s) · 8/20 resolved · 20 trials

Bias-correction and outlier-robust analysis in R: recover trustworthy estimates from messy data, checked against reference results.

Agent · model · scaffoldpass@1 (95% CI)ntasksoutcome mixcost
claude-coden=20 · preview
claude-opus-4-8·Claude Code CLI
40.0%
95% CI [21.9, 61.3]
202,
Product Data Sciencepreview1 task(s) · 4/10 resolved · 10 trials

Applied product analytics: causal impact and decision analysis on real product datasets (R).

Preview, thin data; rates carry wide confidence intervals (shown). Read as directional, not yet rank-grade.

Agent · model · scaffoldpass@1 (95% CI)ntasksoutcome mixcost
claude-coden=10 · preview
claude-opus-4-8·Claude Code CLI
40.0%
95% CI [16.8, 68.7]
101,
Pharmacometricspreview1 task(s) · 4/10 resolved · 10 trials

Population PK/PD modeling with nonlinear mixed-effects: fit drug-exposure models and recover the correct parameters.

Preview, thin data; rates carry wide confidence intervals (shown). Read as directional, not yet rank-grade.

Agent · model · scaffoldpass@1 (95% CI)ntasksoutcome mixcost
claude-coden=10 · preview
claude-opus-4-8·Claude Code CLI
40.0%
95% CI [16.8, 68.7]
101,

pass@1 = resolved / scored trials. pass@k = unbiased Chen-et-al. estimator over a task's k attempts (k shown). CI = Wilson score 95%. Every figure is computed offline from the local control plane, see Methodology.