SyncValsverifier → artifact → classifier → verdict
SyncVals · Task

cg-solver

STEM28/40 resolved↑ All tasks
pass@170.0%
pass@k100.0% @40
resolved28/40
dominantGOOD_SUCCESS
During each run the agent saw only the starter workspace, tests/ and solution/ were withheld and restored only for grading. The solution/ answer key below is sealed to keep this task usable as a benchmark.
Runs (40) , every recorded attempt at this task
AgentModelRewardToolsClassification
claude-codeclaude-opus-4-8✗ failed11GOOD_FAILUREview →
claude-codeclaude-opus-4-8✗ failed13BAD_FAILUREview →
claude-codeclaude-opus-4-8✗ failed14GOOD_FAILUREview →
claude-codeclaude-opus-4-8✓ resolved13GOOD_SUCCESSview →
claude-codeclaude-opus-4-8✓ resolved17GOOD_SUCCESSview →
claude-codeclaude-opus-4-8✗ failed11GOOD_FAILUREview →
claude-codeclaude-opus-4-8✓ resolved16GOOD_SUCCESSview →
claude-codeclaude-opus-4-8✗ failed14BAD_FAILUREview →
claude-codeclaude-opus-4-8✓ resolved14GOOD_SUCCESSview →
claude-codeclaude-opus-4-8✓ resolved14GOOD_SUCCESSview →
claude-codeclaude-opus-4-8✓ resolved15GOOD_SUCCESSview →
claude-codeclaude-opus-4-8✗ failed16BAD_FAILUREview →
claude-codeclaude-opus-4-8✗ failed12GOOD_FAILUREview →
claude-codeclaude-opus-4-8✗ failed14GOOD_FAILUREview →
claude-codeclaude-opus-4-8✗ failed12BAD_FAILUREview →
claude-codeclaude-opus-4-8✓ resolved15GOOD_SUCCESSview →
claude-codeclaude-opus-4-8✓ resolved12GOOD_SUCCESSview →
claude-codeclaude-opus-4-8✓ resolved16GOOD_SUCCESSview →
claude-codeclaude-opus-4-8✓ resolved10GOOD_SUCCESSview →
claude-codeclaude-opus-4-8✗ failed12GOOD_FAILUREview →
claude-codeclaude-opus-4-8✓ resolved15GOOD_SUCCESSview →
claude-codeclaude-opus-4-8✓ resolved16GOOD_SUCCESSview →
claude-codeclaude-opus-4-8✗ failed12BAD_FAILUREview →
claude-codeclaude-opus-4-8✓ resolved10BAD_SUCCESSview →
claude-codeclaude-opus-4-8✓ resolved15GOOD_SUCCESSview →
claude-codeclaude-opus-4-8✓ resolved16GOOD_SUCCESSview →
claude-codeclaude-opus-4-8✗ failed12GOOD_FAILUREview →
claude-codeclaude-opus-4-8✓ resolved12GOOD_SUCCESSview →
claude-codeclaude-opus-4-8✓ resolved11GOOD_SUCCESSview →
claude-codeclaude-opus-4-8✓ resolved12GOOD_SUCCESSview →
claude-codeclaude-opus-4-8✓ resolved15GOOD_SUCCESSview →
claude-codeclaude-opus-4-8✓ resolved14GOOD_SUCCESSview →
claude-codeclaude-opus-4-8✓ resolved16GOOD_SUCCESSview →
claude-codeclaude-opus-4-8✓ resolved17GOOD_SUCCESSview →
claude-codeclaude-opus-4-8✓ resolved16GOOD_SUCCESSview →
claude-codeclaude-opus-4-8✓ resolved16GOOD_SUCCESSview →
claude-codeclaude-opus-4-8✓ resolved14GOOD_SUCCESSview →
claude-codeclaude-opus-4-8✓ resolved10GOOD_SUCCESSview →
claude-codeclaude-opus-4-8✓ resolved16GOOD_SUCCESSview →
claude-codeclaude-opus-4-8✓ resolved12GOOD_SUCCESSview →
Task files , browse the workspace, grader & sealed answer key
Select a file
Select a file from the tree to view it.

The task as the agent saw it and the verifier graded it. Files under solution/ are sealed.