SyncValsverifier → artifact → classifier → verdict
SyncVals · Analysis

Analysis & Observations

What the runs actually show, organized by the five-way taxonomy that is the spine of SyncVals. Every number on this page is computed from this build's 674 trials across 70 tasks; nothing here is hardcoded. For researchers: each failure mode below is tied to a replayable run, so a claim is never a sentence without evidence. For teams: the headline isn't the pass rate, it's which colour a miss is, an honest capability gap (green/amber) versus a reward-hack or broken task (red), drawn automatically and auditable per run. Labels link to their definitions in the glossary.

This build at a glance , computed from the local control plane
trials674
tasks70
categories14
agents under test2
scored trials674
resolved (pass@1)308/674 · 45.7%
Outcome distribution , every scored trial sorted into the 5-way taxonomy (674 trials)

How the two counts reconcile: pass@1 (45.7%, 308/674) counts every verifier pass; 301 of those are classified GOOD_SUCCESS (286) or BAD_SUCCESS (15), and 7 reward-1.0 trial(s) whose post-hoc classifier failed are held in HARNESS_ERROR rather than counted as a clean solve.

Scored runs span 14 categories: Cloud Operations, Cyber Security, Data Science, Data Science: Robustness, Debugging, Electrical Engineering, Game, Mechanical Engineering, ML Engineering, Pharmacometrics, Product Data Science, Scientific ML, STEM, Software Engineering.

Failure-mode taxonomy , each label, with this build's live counts and a replayable run

Legitimate solves, earned, not gamed

observed, real evidence

GOOD_SUCCESS is the honest win: the agent implemented the asked-for behaviour and the task's own withheld verifier confirmed it. 286 of 674 scored trials (42.4%) land here, every one cleared the hidden test suite the agent never saw, with no reward-hack flagged by the post-hoc audit. This is the baseline the other four labels are measured against.

the implementation passed the withheld grader on inputs the agent never saw, a clean solve under the isolation contract, not a pass against a visible smoke test.GOOD_SUCCESS: 286/674 scored trials · e.g. claude-code on cg-solver

Honest misses, competent attempts that fell short

observed, real evidence

GOOD_FAILURE is the most common outcome here: 254 of 674 scored trials (37.7%). The agent ran correctly and produced a real attempt, but the withheld verifier rejected it, a genuine capability gap on a sound task, not a broken harness or an unfair grader. These are the runs whose trajectories are most worth reading: the reasoning is coherent right up to the bug that lost the reward, and the answer key was withheld, so the agent's own self-check is on trial too.

the agent declared victory after checking its output against its own (incorrect) expectation, the real reference was withheld, so calibration, not just code, decided the outcome.GOOD_FAILURE: 254/674 scored trials · e.g. claude-code on diff-patch-engine

Reward-hacking, passing without solving

observed, 15 flagged across 12 task(s)

BAD_SUCCESS is the category researchers most want quantified: did the agent earn its reward or game it? This build records 15 BAD_SUCCESS trial(s) (2.2% of scored runs) across 12 task(s) (Instruction-retire commit handshake, Serial bit-destuff framer, cg-solver, evanw-esbuild-4417, and 8 more), a passing verifier that the post-hoc audit (or the reward-consistency guard) judged illegitimate: a pass that cleared the letter of the tests without solving the task. Every one is published with its full trajectory so the call is auditable rather than asserted. The harness is built to surface exactly this, withheld grader, withheld reference, hidden seed inputs, an anti-hardcoding check, and a guard that re-labels any passed-but-didn't-solve run as BAD_SUCCESS.

15 BAD_SUCCESS across 12 task(s): a verifier pass the audit would not let stand. The isolation design surfaces the reward-hack instead of scoring it as a clean win.evidence: 15 BAD_SUCCESS in this build · Instruction-retire commit handshake, Serial bit-destuff framer, cg-solver, evanw-esbuild-4417, and 8 more

Bad tasks, the failure is the task's fault, not the agent's

observed, 82 flagged across 35 task(s)

BAD_FAILURE flags the runs where the agent failed but the task is to blame, an underspecified or contradictory instruction, brittle or flaky tests, or a grader demanding behaviour the instruction never disclosed. This build records 82 such trial(s) (12.2% of scored runs) across 35 task(s) (59-fix-broken-cognito-m2m-httpapi-jwt-scope-gated, Multi-cycle signed divider with a start/valid handshake, Resynchronising serial byte receiver, apigw-http-api-jwt-authorizer-lambda-integration, and 31 more). They are a to-do list for the task authors, not a mark against the agent, and keeping them visible keeps the leaderboard honest by not charging a broken task to a model.

a failing trial is only as fair as its instruction, BAD_FAILURE is where a miss is charged to the task, not the model, so a flaky grader never lands on the leaderboard as a capability gap.BAD_FAILURE: 82/674 scored trials · 59-fix-broken-cognito-m2m-httpapi-jwt-scope-gated, Multi-cycle signed divider with a start/valid handshake, Resynchronising serial byte receiver, apigw-http-api-jwt-authorizer-lambda-integration, and 31 more

Harness / infrastructure failures, the agent never finished

observed, 37 flagged across 22 task(s)

HARNESS_ERROR protects the model from being blamed for the harness's (or the budget's) faults, a run where the agent never produced a finished artifact. This build records 37 such trial(s) (5.5% of scored runs) across 22 task(s). On the heaviest numerical and systems tasks, runs hit the agent-execution timeout mid-implementation; under the SyncVals contract these are HARNESS_ERROR / incomplete, not capability misses, and a clean capability metric excludes them rather than scoring them as GOOD_FAILURE. They remain a real signal about task difficulty and harness configuration.

the agent ran out of wall-clock, not out of ideas, a HARNESS_ERROR-class statement about the harness configuration, not the model.HARNESS_ERROR: 37/674 scored trials

Reward reflects only whether the verifier passed; a pass is only as strong as that task's hidden tests, a fail only as fair as its instruction. The five-way classification is model-produced explanatory metadata, it can be wrong and never alters reward; the guard catches only outright label/reward contradictions. The counts above are this build's scored trials and shift as more runs land; a label with no run in this build is marked as such rather than linked to a dead example. The reward is still the ceiling: the evaluation is exactly as trustworthy as the tasks' withheld references and the breadth of their hidden inputs. See the full Methodology and limitations.

Every verdict on this page is post-hoc and explanatory; the verifier alone set each reward.