SyncValsverifier → artifact → classifier → verdict
SyncVals · Task

diff-patch-engine

Software Engineering0/10 resolved↑ All tasks
pass@10.0%
pass@k0.0% @10
resolved0/10
dominantGOOD_FAILURE
During each run the agent saw only the starter workspace, tests/ and solution/ were withheld and restored only for grading. The solution/ answer key below is sealed to keep this task usable as a benchmark.
Runs (10) , every recorded attempt at this task
AgentModelRewardToolsClassification
claude-codeclaude-opus-4-8✗ failed34GOOD_FAILUREview →
claude-codeclaude-opus-4-8✗ failed47GOOD_FAILUREview →
claude-codeclaude-opus-4-8✗ failed25GOOD_FAILUREview →
claude-codeclaude-opus-4-8✗ failed19GOOD_FAILUREview →
claude-codeclaude-opus-4-8✗ failed20GOOD_FAILUREview →
claude-codeclaude-opus-4-8✗ failed26GOOD_FAILUREview →
claude-codeclaude-opus-4-8✗ failed49GOOD_FAILUREview →
claude-codeclaude-opus-4-8✗ failed21GOOD_FAILUREview →
claude-codeclaude-opus-4-8✗ failed31GOOD_FAILUREview →
claude-codeclaude-opus-4-8✗ failed45GOOD_FAILUREview →
Task files , browse the workspace, grader & sealed answer key
Select a file
Select a file from the tree to view it.

The task as the agent saw it and the verifier graded it. Files under solution/ are sealed.