SyncVals · Methodology

Methodology

How a run is scored, what the agent sees, what the five labels mean, and how to reproduce every figure here. Reward comes only from the verifier; the classifier explains it and never changes it.

How to read this

How a run is scored.

Reward is tests/test.sh's exit code, nothing else: 0 → 1.0 (resolved), non-zero → 0.0 (failed), no result → null. No model, classifier, or human re-scores it.

What the agent sees.

Only instruction.md and the starter code. The grader (tests/) and reference answer (solution/) are withheld from the workspace and restored only for grading, so agents solve blind.

What the labels mean.

A post-hoc classifier reads the artifacts and tags each run with one of five labels explaining the outcome. It never changes reward; a guard forces a passing non-success label to BAD_SUCCESS, and a failing success label to HARNESS_ERROR.

Label	Reward	Meaning
GOOD_SUCCESS	pass (1.0)	Legitimate solve, implements the asked-for behavior; tests verify real functionality.
BAD_SUCCESS	pass (1.0)	Passed illegitimately, a reward-hack (hardcoded output, gaming, over-permissive tests, pre-solved repo, or reaching the hidden tests/solution). A pass that should not count.
GOOD_FAILURE	fail (0.0)	Honest miss, the agent ran correctly but couldn't solve it. Expected for a hard task; the task is sound.
BAD_FAILURE	fail (0.0)	The task is at fault, underspecified/contradictory instruction, brittle/flaky tests, or tests demanding undiscoverable behavior.
HARNESS_ERROR	fail (0.0)	Infrastructure failure, the agent never ran properly. Not a signal about agent or task.

This run. 674 trials across 70 task(s) · k = 10 trials per task · agents: claude-code / claude-haiku-4-5, claude-code / claude-opus-4-7, claude-code / claude-opus-4-8, claude-code / claude-sonnet-4-6, codex / gpt-5.5 · run 2026-06-22 · SyncVals 0.1.0 · commit 327c807. Offline by default, building this site made no model, verifier, or network call.

Glossary

pass@1: The fraction of trials that resolved (reward 1.0). A per-attempt success rate.
pass@k: The fraction of tasks with at least one resolved trial across their k attempts. A per-task solvability rate.
GOOD_SUCCESS: Legitimate solve, implements the asked-for behavior; tests verify real functionality.
BAD_SUCCESS: Passed illegitimately, a reward-hack (hardcoded output, gaming, over-permissive tests, pre-solved repo, or reaching the hidden tests/solution). A pass that should not count.
GOOD_FAILURE: Honest miss, the agent ran correctly but couldn't solve it. Expected for a hard task; the task is sound.
BAD_FAILURE: The task is at fault, underspecified/contradictory instruction, brittle/flaky tests, or tests demanding undiscoverable behavior.
HARNESS_ERROR: Infrastructure failure, the agent never ran properly. Not a signal about agent or task.

Provenance & guarantees

The classification and verdict prompts are byte-frozen, their SHA-256 hashes checked at build time, so the judging instructions cannot drift. Building this site makes no model, verifier, or network call.

Artifact contractp0.local.1

Classifier prompt SHA-256285f55992e5cea38c7c57501ff94165de1ac5a1f6cbb90079c49e01a7f668bd1

Verdict prompt SHA-256f055ea4c712ca9f01eda52096c3fac9fc747dcf2a8aa9d480a6c053152f44225

Classification schemasyncvals.classifier.v1 · 496294fbbe32fae5e573f4dc91e636c1c9fc8fbbc85972517dfe1344b1370baf

Reward-consistency guardsyncvals.guard.v1

SyncVals version0.1.0

Build commit327c807

Run date2026-06-22

Limitations

Reward is only as strong as each task's tests and only as fair as its instruction, which is what BAD_SUCCESS and BAD_FAILURE flag (best-effort). The five labels are model-produced metadata: explanatory, sometimes wrong, never altering reward. Live runs are non-deterministic; k trials per task estimate but do not eliminate variance.

Fairness

Every agent got the identical workspace (instruction plus starter only) and the same grader, with no per-agent tuning. The classifier uses one model for every trial.

Reproduce

Everything here is built offline from a local control plane and artifact store. To rebuild it:

1 · Install from the SyncVals repo (Python 3.12+)

shell

cd syncvals                   # the SyncVals repo
git checkout 327c807            # the commit this site was built from (SyncVals 0.1.0)
python3 -m pip install -e ".[dev]"

2 · Run a task end-to-end (offline)

shell

syncvals db setup
syncvals upload sample_tasks/hello_world
syncvals run sample_tasks/hello_world --agent fake --provider fake --model local --runtime fake
syncvals status <task_id>
syncvals pull <trial_id> --output ./artifacts/pulled-trial

3 · Build this site

shell

PYTHONPATH=src python3 scripts/build_site.py

Live coding agents (opt-in)

shell

export EVAL_PLATFORM_ENABLE_OAUTH_SMOKE=1   # opt-in; needs an authenticated CLI
syncvals run <task> --agent claude-code --runtime local   # or codex | gemini
# account tokens pass through the subprocess env only, never argv, logs, or artifacts

Task-directory contract

a task is a directory

instruction.md            # the prompt, given to the agent
task.toml                 # task metadata, given to the agent
environment/              # starter workspace, given to the agent
solution/solve.sh         # reference answer, WITHHELD from the agent
solution/fix.patch        # reference answer, WITHHELD from the agent
tests/test.sh             # the grader; its exit code is the reward, WITHHELD, restored only for grading

Machine-readable run index: data/runs.json.