How a run is scored, what the agent sees, what the five labels mean, and how to reproduce every figure here. Reward comes only from the verifier; the classifier explains it and never changes it.
Reward is tests/test.sh's exit code, nothing else: 0 → 1.0 (resolved), non-zero → 0.0 (failed), no result → null. No model, classifier, or human re-scores it.
Only instruction.md and the starter code. The grader (tests/) and reference answer (solution/) are withheld from the workspace and restored only for grading, so agents solve blind.
A post-hoc classifier reads the artifacts and tags each run with one of five labels explaining the outcome. It never changes reward; a guard forces a passing non-success label to BAD_SUCCESS, and a failing success label to HARNESS_ERROR.
| Label | Reward | Meaning |
|---|---|---|
| GOOD_SUCCESS | pass (1.0) | Legitimate solve, implements the asked-for behavior; tests verify real functionality. |
| BAD_SUCCESS | pass (1.0) | Passed illegitimately, a reward-hack (hardcoded output, gaming, over-permissive tests, pre-solved repo, or reaching the hidden tests/solution). A pass that should not count. |
| GOOD_FAILURE | fail (0.0) | Honest miss, the agent ran correctly but couldn't solve it. Expected for a hard task; the task is sound. |
| BAD_FAILURE | fail (0.0) | The task is at fault, underspecified/contradictory instruction, brittle/flaky tests, or tests demanding undiscoverable behavior. |
| HARNESS_ERROR | fail (0.0) | Infrastructure failure, the agent never ran properly. Not a signal about agent or task. |
This run. 674 trials across 70 task(s) · k = 10 trials per task · agents: claude-code / claude-haiku-4-5, claude-code / claude-opus-4-7, claude-code / claude-opus-4-8, claude-code / claude-sonnet-4-6, codex / gpt-5.5 · run 2026-06-22 · SyncVals 0.1.0 · commit 327c807. Offline by default, building this site made no model, verifier, or network call.
The classification and verdict prompts are byte-frozen, their SHA-256 hashes checked at build time, so the judging instructions cannot drift. Building this site makes no model, verifier, or network call.
Reward is only as strong as each task's tests and only as fair as its instruction, which is what BAD_SUCCESS and BAD_FAILURE flag (best-effort). The five labels are model-produced metadata: explanatory, sometimes wrong, never altering reward. Live runs are non-deterministic; k trials per task estimate but do not eliminate variance.
Every agent got the identical workspace (instruction plus starter only) and the same grader, with no per-agent tuning. The classifier uses one model for every trial.
Everything here is built offline from a local control plane and artifact store. To rebuild it:
cd syncvals # the SyncVals repo git checkout 327c807 # the commit this site was built from (SyncVals 0.1.0) python3 -m pip install -e ".[dev]"
syncvals db setup syncvals upload sample_tasks/hello_world syncvals run sample_tasks/hello_world --agent fake --provider fake --model local --runtime fake syncvals status <task_id> syncvals pull <trial_id> --output ./artifacts/pulled-trial
PYTHONPATH=src python3 scripts/build_site.py
export EVAL_PLATFORM_ENABLE_OAUTH_SMOKE=1 # opt-in; needs an authenticated CLI syncvals run <task> --agent claude-code --runtime local # or codex | gemini # account tokens pass through the subprocess env only, never argv, logs, or artifacts
instruction.md # the prompt, given to the agent task.toml # task metadata, given to the agent environment/ # starter workspace, given to the agent solution/solve.sh # reference answer, WITHHELD from the agent solution/fix.patch # reference answer, WITHHELD from the agent tests/test.sh # the grader; its exit code is the reward, WITHHELD, restored only for grading
Machine-readable run index: data/runs.json.