SyncVals · Results

Coding-Agent Results

Each coding agent is summarised on honest pass@1: the share of its trials a task's own verifier solved (exit 0), with reward-hacks and broken tasks audited out. The strip below defines every metric; the per-category breakdown is secondary and collapsed by default.

Coverage overview, not a head-to-head ranking. Each agent ran a different subset of tasks (68 of 70 tasks were attempted by a single agent), so pass@1 is not directly comparable across agents. Read each row as "how this agent did on the tasks it ran", not "agent A beat agent B".

Run 2026-06-22 · 674 trials / 70 task(s) · SyncVals 0.1.0 · commit 327c807. Offline by default, building this board made no model, verifier, or network call.

5 agent/model line(s) over 674 scored trials, each on a different task subset. Most-evaluated: claude-code / claude-opus-4-8 (n=391, 45.5% on its own tasks, 95% CI [40.7, 50.5]); pass@1 is not comparable across agents that ran different task subsets. Category coverage: 10 live, 4 preview.

How to read this , the ranked unit and each metric, defined inline

Ranked unitagent × modelOne line per coding agent paired with the model that drove it.

Metricpass@1Solved ÷ scored trials. Solved = the task's own verifier exits 0.

nscored trialsThe denominator. Fewer than 20 → flagged preview, CI stays wide.

Honesty95% CI + outcomesWilson 95% band (the whisker) on every rate; the bar splits each agent's runs into the 5 outcomes, green honest, red reward-hack / bad-task, amber harness.

Results by agent , all 5 scored agent×model lines, each on the tasks it ran (not a cross-agent ranking)

Outcome barGOOD SUCCESSBAD SUCCESSGOOD FAILUREBAD FAILUREHARNESS ERRORgreen = honest, red = reward-hack / bad task, amber = harness/budget

Agent · model · scaffold	pass@1 (95% CI)	n	tasks	cost
claude-code claude-opus-4-8·Claude Code CLI	45.5% 95% CI [40.7, 50.5]	391	36	,
claude-code claude-opus-4-7·Claude Code CLI	31.1% 95% CI [24.8, 38.2]	180	19	,
codex gpt-5.5·Codex CLI	84.9% 95% CI [72.9, 92.1]	53	12	,
claude-code claude-sonnet-4-6·Claude Code CLI	33.3% 95% CI [19.2, 51.2]	30	3	,
claude-coden=20 · preview claude-haiku-4-5·Claude Code CLI	95.0% 95% CI [76.4, 99.1]	20	2	$0.142

Coverage by category , 10 of 14 categories have rank-grade data; the rest are preview / scoped

A bird's-eye view of where the bench has signal. Live = ≥20 scored trials (rank-grade); preview = thin (wide CIs); pending = scoped, not yet run. Click a category to open its full ranked sub-board below.

Category	Status	tasks	trials	solved-rate	leading agent
Software Engineering	live	8/8	85	57.6% 49/85	claude-code 50.0%
Mechanical Engineering	live	8/8	80	35.0% 28/80	claude-code 35.0%
Cloud Operations	live	10/10	100	32.0% 32/100	claude-code 32.0%
Electrical Engineering	live	7/7	28	71.4% 20/28	codex 71.4%
STEM	live	5/5	81	72.8% 59/81	claude-code 72.8%
Game	live	10/10	105	43.8% 46/105	claude-code 35.0%
Cyber Security	live	9/9	80	30.0% 24/80	claude-code 30.0%
Debugging	preview	3/3	15	100.0% 15/15	codex 100.0%
ML Engineering	preview	1/1	10	40.0% 4/10	claude-code 40.0%
Scientific ML	live	3/3	30	33.3% 10/30	claude-code 33.3%
Data Science	live	2/2	20	25.0% 5/20	claude-code 25.0%
Data Science: Robustness	live	2/2	20	40.0% 8/20	claude-code 40.0%
Product Data Science	preview	1/1	10	40.0% 4/10	claude-code 40.0%
Pharmacometrics	preview	1/1	10	40.0% 4/10	claude-code 40.0%

Per-category detail , expand a category for its own ranked sub-board

Software Engineeringlive8 task(s) · 49/85 resolved · 85 trials

Algorithms, data structures, bug-fixes, and API/systems implementation, graded against hidden test suites the agent never sees.

Agent · model · scaffold	pass@1 (95% CI)	n	tasks	cost
claude-code claude-opus-4-8·Claude Code CLI	50.0% 95% CI [38.6, 61.4]	70	7	,
claude-coden=10 · preview claude-haiku-4-5·Claude Code CLI	90.0% 95% CI [59.6, 98.2]	10	1	$0.214
codexn=5 · preview gpt-5.5·Codex CLI	100.0% 95% CI [56.6, 100.0]	5	1	,

Mechanical Engineeringlive8 task(s) · 28/80 resolved · 80 trials

Structural and numerical solvers (FD/FE, contact dynamics) in C++, checked against multi-binary hidden references.

Agent · model · scaffold	pass@1 (95% CI)	n	tasks	outcome mix	cost
claude-code claude-opus-4-8·Claude Code CLI	35.0% 95% CI [25.5, 45.9]	80	8		,

Cloud Operationslive10 task(s) · 32/100 resolved · 100 trials

Provision, diagnose, and guardrail cloud infrastructure (AWS), verified against the deployed state.

Agent · model · scaffold	pass@1 (95% CI)	n	tasks	outcome mix	cost
claude-code claude-opus-4-7·Claude Code CLI	32.0% 95% CI [23.7, 41.7]	100	10		,

Electrical Engineeringlive7 task(s) · 20/28 resolved · 28 trials

RTL / digital-logic design checked under hardened simulation and formal-equivalence harnesses.

Agent · model · scaffold	pass@1 (95% CI)	n	tasks	outcome mix	cost
codex gpt-5.5·Codex CLI	71.4% 95% CI [52.9, 84.7]	28	7		,

STEMlive5 task(s) · 59/81 resolved · 81 trials

Quantitative reasoning across math, physics, and the natural sciences with deterministic, checkable answers.

Agent · model · scaffold	pass@1 (95% CI)	n	tasks	outcome mix	cost
claude-code claude-opus-4-8·Claude Code CLI	72.8% 95% CI [62.3, 81.3]	81	5		,

Gamelive10 task(s) · 46/105 resolved · 105 trials

Interactive simulation and game-logic tasks graded on exact state transitions and rule fidelity.

Agent · model · scaffold	pass@1 (95% CI)	n	tasks	cost
claude-code claude-opus-4-8·Claude Code CLI	35.0% 95% CI [24.2, 47.6]	60	6	,
claude-code claude-sonnet-4-6·Claude Code CLI	33.3% 95% CI [19.2, 51.2]	30	3	,
claude-coden=10 · preview claude-haiku-4-5·Claude Code CLI	100.0% 95% CI [72.2, 100.0]	10	1	$0.069
codexn=5 · preview gpt-5.5·Codex CLI	100.0% 95% CI [56.6, 100.0]	5	1	,

Cyber Securitylive9 task(s) · 24/80 resolved · 80 trials

Vulnerability discovery, exploitation, and remediation verified against a concrete security objective.

Agent · model · scaffold	pass@1 (95% CI)	n	tasks	outcome mix	cost
claude-code claude-opus-4-7·Claude Code CLI	30.0% 95% CI [21.1, 40.8]	80	9		,

Debuggingpreview3 task(s) · 15/15 resolved · 15 trials

Real-world bug-fix tasks distilled from open-source issues (esbuild, klauspost/compress, rust-lang/semver), graded against each project's own tests.

Preview, thin data; rates carry wide confidence intervals (shown). Read as directional, not yet rank-grade.

Agent · model · scaffold	pass@1 (95% CI)	n	tasks	outcome mix	cost
codexn=15 · preview gpt-5.5·Codex CLI	100.0% 95% CI [79.6, 100.0]	15	3		,

ML Engineeringpreview1 task(s) · 4/10 resolved · 10 trials

Train and tune models to a target metric on real engineering datasets, graded on held-out performance against a solved-reward threshold.

Preview, thin data; rates carry wide confidence intervals (shown). Read as directional, not yet rank-grade.

Agent · model · scaffold	pass@1 (95% CI)	n	tasks	outcome mix	cost
claude-coden=10 · preview claude-opus-4-8·Claude Code CLI	40.0% 95% CI [16.8, 68.7]	10	1		,

Scientific MLlive3 task(s) · 10/30 resolved · 30 trials

Physics-informed and surrogate modeling: PDE forecasting and CFD/FEA prediction, graded on quantitative predictive accuracy.

Agent · model · scaffold	pass@1 (95% CI)	n	tasks	outcome mix	cost
claude-code claude-opus-4-8·Claude Code CLI	33.3% 95% CI [19.2, 51.2]	30	3		,

Data Sciencelive2 task(s) · 5/20 resolved · 20 trials

Modeling and statistical inference on real datasets (federated learning, event studies), graded against held-out ground truth.

Agent · model · scaffold	pass@1 (95% CI)	n	tasks	outcome mix	cost
claude-coden=20 · preview claude-opus-4-8·Claude Code CLI	25.0% 95% CI [11.2, 46.9]	20	2		,

Data Science: Robustnesslive2 task(s) · 8/20 resolved · 20 trials

Bias-correction and outlier-robust analysis in R: recover trustworthy estimates from messy data, checked against reference results.

Agent · model · scaffold	pass@1 (95% CI)	n	tasks	outcome mix	cost
claude-coden=20 · preview claude-opus-4-8·Claude Code CLI	40.0% 95% CI [21.9, 61.3]	20	2		,

Product Data Sciencepreview1 task(s) · 4/10 resolved · 10 trials

Applied product analytics: causal impact and decision analysis on real product datasets (R).

Preview, thin data; rates carry wide confidence intervals (shown). Read as directional, not yet rank-grade.

Agent · model · scaffold	pass@1 (95% CI)	n	tasks	outcome mix	cost
claude-coden=10 · preview claude-opus-4-8·Claude Code CLI	40.0% 95% CI [16.8, 68.7]	10	1		,

Pharmacometricspreview1 task(s) · 4/10 resolved · 10 trials

Population PK/PD modeling with nonlinear mixed-effects: fit drug-exposure models and recover the correct parameters.

Preview, thin data; rates carry wide confidence intervals (shown). Read as directional, not yet rank-grade.

Agent · model · scaffold	pass@1 (95% CI)	n	tasks	outcome mix	cost
claude-coden=10 · preview claude-opus-4-8·Claude Code CLI	40.0% 95% CI [16.8, 68.7]	10	1		,

pass@1 = resolved / scored trials. pass@k = unbiased Chen-et-al. estimator over a task's k attempts (k shown). CI = Wilson score 95%. Every figure is computed offline from the local control plane, see Methodology.