SyncValsverifier → artifact → classifier → verdict
SyncVals · Tasks

Tasks by category

Every task is a self-contained Harbor-shaped directory, a problem statement, an executable environment, a hidden reference solution, and a deterministic verifier. Browse by category, then open any task to read the full problem, the starter workspace the agent saw, the withheld grader, and every recorded run. Category status is derived from the data, not asserted.

Software Engineeringlive

Algorithms, data structures, bug-fixes, and API/systems implementation, graded against hidden test suites the agent never sees.

8 task(s) · 49/85 resolved · 85 trials
Mechanical Engineeringlive

Structural and numerical solvers (FD/FE, contact dynamics) in C++, checked against multi-binary hidden references.

8 task(s) · 28/80 resolved · 80 trials
Cloud Operationslive

Provision, diagnose, and guardrail cloud infrastructure (AWS), verified against the deployed state.

10 task(s) · 32/100 resolved · 100 trials
Electrical Engineeringlive

RTL / digital-logic design checked under hardened simulation and formal-equivalence harnesses.

7 task(s) · 20/28 resolved · 28 trials
STEMlive

Quantitative reasoning across math, physics, and the natural sciences with deterministic, checkable answers.

5 task(s) · 59/81 resolved · 81 trials
Gamelive

Interactive simulation and game-logic tasks graded on exact state transitions and rule fidelity.

10 task(s) · 46/105 resolved · 105 trials
Cyber Securitylive

Vulnerability discovery, exploitation, and remediation verified against a concrete security objective.

9 task(s) · 24/80 resolved · 80 trials
Debuggingpreview

Real-world bug-fix tasks distilled from open-source issues (esbuild, klauspost/compress, rust-lang/semver), graded against each project's own tests.

3 task(s) · 15 scored trial(s) · preview
ML Engineeringpreview

Train and tune models to a target metric on real engineering datasets, graded on held-out performance against a solved-reward threshold.

1 task(s) · 10 scored trial(s) · preview
Scientific MLlive

Physics-informed and surrogate modeling: PDE forecasting and CFD/FEA prediction, graded on quantitative predictive accuracy.

3 task(s) · 10/30 resolved · 30 trials
Data Sciencelive

Modeling and statistical inference on real datasets (federated learning, event studies), graded against held-out ground truth.

2 task(s) · 5/20 resolved · 20 trials
Data Science: Robustnesslive

Bias-correction and outlier-robust analysis in R: recover trustworthy estimates from messy data, checked against reference results.

2 task(s) · 8/20 resolved · 20 trials
Product Data Sciencepreview

Applied product analytics: causal impact and decision analysis on real product datasets (R).

1 task(s) · 10 scored trial(s) · preview
Pharmacometricspreview

Population PK/PD modeling with nonlinear mixed-effects: fit drug-exposure models and recover the correct parameters.

1 task(s) · 10 scored trial(s) · preview
Outcome barGOOD SUCCESSBAD SUCCESSGOOD FAILUREBAD FAILUREHARNESS ERRORgreen = honest, red = reward-hack / bad task, amber = harness/budget
Software Engineeringlive8 graded · 49/85 resolved · 85 trials

Algorithms, data structures, bug-fixes, and API/systems implementation, graded against hidden test suites the agent never sees.

Task#runsresolved k/npass@1dominant verdict
diff-patch-engine100/100.0%GOOD_FAILUREview →
idempotency-middleware104/1040.0%GOOD_FAILUREview →
lru-cache1514/1593.3%GOOD_SUCCESSview →
occ-conditional-store109/1090.0%GOOD_SUCCESSview →
rate-limiter102/1020.0%BAD_FAILUREview →
resilient-http-client100/100.0%GOOD_FAILUREview →
session-token-verify1010/10100.0%GOOD_SUCCESSview →
window-aggregate-store1010/10100.0%GOOD_SUCCESSview →
Mechanical Engineeringlive8 graded · 28/80 resolved · 80 trials

Structural and numerical solvers (FD/FE, contact dynamics) in C++, checked against multi-binary hidden references.

Task#runsresolved k/npass@1dominant verdict
beam-deflection-solver103/1030.0%GOOD_FAILUREview →
collision2d-impulse-solver100/100.0%GOOD_FAILUREview →
heat1d-conduction-solver103/1030.0%GOOD_FAILUREview →
pipeflow-colebrook-solver104/1040.0%GOOD_FAILUREview →
projectile-drag-integrator102/1020.0%GOOD_FAILUREview →
quaternion-rotation-integrator100/100.0%GOOD_FAILUREview →
rk4-orbit-integrator1010/10100.0%GOOD_SUCCESSview →
truss2d-solver106/1060.0%GOOD_SUCCESSview →
Cloud Operationslive10 graded · 32/100 resolved · 100 trials

Provision, diagnose, and guardrail cloud infrastructure (AWS), verified against the deployed state.

Task#runsresolved k/npass@1dominant verdict
apigw-sqs-fifo-direct-integration102/1020.0%GOOD_FAILUREview →
ddb-outbox-eventbridge-fanout104/1040.0%GOOD_FAILUREview →
iam-cross-account-externalid-sourcearn103/1030.0%BAD_FAILUREview →
iam-revoke-older-sessions104/1040.0%GOOD_SUCCESSview →
iam-session-tag-tenant-scope103/1030.0%GOOD_FAILUREview →
s3-lambda-ddb-pipeline102/1020.0%HARNESS_ERRORview →
s3-sqs-image-pipeline-kms102/1020.0%GOOD_FAILUREview →
secrets-rotation-kms106/1060.0%GOOD_SUCCESSview →
sfn-saga-compensation-orchestrator105/1050.0%GOOD_SUCCESSview →
sfn-secrets-rotation-chain101/1010.0%GOOD_FAILUREview →
Electrical Engineeringlive7 graded · 20/28 resolved · 28 trials

RTL / digital-logic design checked under hardened simulation and formal-equivalence harnesses.

Task#runsresolved k/npass@1dominant verdict
Enable-gated streaming fold stage33/3100.0%GOOD_SUCCESSview →
Instruction-retire commit handshake33/3100.0%BAD_SUCCESSview →
Multi-cycle signed divider with a start/valid handshake30/30.0%GOOD_FAILUREview →
Resynchronising serial byte receiver30/30.0%BAD_FAILUREview →
Serial bit-destuff framer33/3100.0%GOOD_SUCCESSview →
Wait-state register-file completer33/3100.0%GOOD_SUCCESSview →
open-drain-command-engine108/1080.0%GOOD_SUCCESSview →
STEMlive5 graded · 59/81 resolved · 81 trials

Quantitative reasoning across math, physics, and the natural sciences with deterministic, checkable answers.

Task#runsresolved k/npass@1dominant verdict
adaptive-quadrature1010/10100.0%GOOD_SUCCESSview →
anova-stats1111/11100.0%GOOD_SUCCESSview →
cg-solver4028/4070.0%GOOD_SUCCESSview →
cholesky-solver100/100.0%GOOD_FAILUREview →
cubic-spline1010/10100.0%GOOD_SUCCESSview →
Gamelive10 graded · 46/105 resolved · 105 trials

Interactive simulation and game-logic tasks graded on exact state transitions and rule fidelity.

Task#runsresolved k/npass@1dominant verdict
game-of-life-step1515/15100.0%GOOD_SUCCESSview →
task_0026_codex_camera_shake_rig104/1040.0%HARNESS_ERRORview →
task_0040_day_night_cycle_controller101/1010.0%GOOD_FAILUREview →
task_0053_car_scene_assembly106/1060.0%GOOD_SUCCESSview →
task_0054_assemble_crusader_animatedsprite104/1040.0%GOOD_FAILUREview →
task_0108_track_particles_driven_by_tile_type_gradient102/1020.0%HARNESS_ERRORview →
task_0131_minimap_ui_complex100/100.0%GOOD_FAILUREview →
task_0132_minimap_marker_logic_complex104/1040.0%GOOD_FAILUREview →
task_9001_checkpoint_system104/1040.0%GOOD_FAILUREview →
task_9002_combo_score_system106/1060.0%GOOD_SUCCESSview →
Cyber Securitylive9 graded · 24/80 resolved · 80 trials

Vulnerability discovery, exploitation, and remediation verified against a concrete security objective.

Task#runsresolved k/npass@1dominant verdict
59-fix-broken-cognito-m2m-httpapi-jwt-scope-gated101/1010.0%GOOD_FAILUREview →
60-fix-broken-ecs-fargate-secrets-kms-exec-role76/785.7%GOOD_SUCCESSview →
apigw-http-api-jwt-authorizer-lambda-integration103/1030.0%BAD_FAILUREview →
athena-workgroup-result-encryption-cmk-enforced52/540.0%GOOD_FAILUREview →
ecr-image-scan-lifecycle-immutable-tags-replication93/933.3%GOOD_FAILUREview →
efs-access-point-posix-iam-mount-target103/1030.0%BAD_FAILUREview →
fix-broken-appsync-graphql-cognito-resolver-cache-leak101/1010.0%GOOD_FAILUREview →
glue-etl-catalog-security-configuration-kms93/933.3%GOOD_FAILUREview →
iam-permissions-boundary-ceiling102/1020.0%GOOD_FAILUREview →
Debuggingpreview3 graded · 15/15 resolved · 15 trials

Real-world bug-fix tasks distilled from open-source issues (esbuild, klauspost/compress, rust-lang/semver), graded against each project's own tests.

Task#runsresolved k/npass@1dominant verdict
evanw-esbuild-441755/5100.0%GOOD_SUCCESSview →
klauspost-compress-111555/5100.0%GOOD_SUCCESSview →
rust-lang-semver-30555/5100.0%GOOD_SUCCESSview →
ML Engineeringpreview1 graded · 4/10 resolved · 10 trials

Train and tune models to a target metric on real engineering datasets, graded on held-out performance against a solved-reward threshold.

Task#runsresolved k/npass@1dominant verdict
airfoil-self-noise104/1040.0%GOOD_FAILUREview →
Scientific MLlive3 graded · 10/30 resolved · 30 trials

Physics-informed and surrogate modeling: PDE forecasting and CFD/FEA prediction, graded on quantitative predictive accuracy.

Task#runsresolved k/npass@1dominant verdict
airfrans-high-reynolds-drag-extrapolation104/1040.0%GOOD_FAILUREview →
ks-equation-1d-forecast104/1040.0%GOOD_FAILUREview →
simjeb-bracket-fea-mass-prediction-real102/1020.0%GOOD_FAILUREview →
Data Sciencelive2 graded · 5/20 resolved · 20 trials

Modeling and statistical inference on real datasets (federated learning, event studies), graded against held-out ground truth.

Task#runsresolved k/npass@1dominant verdict
fedavg-federated-noniid-mnist102/1020.0%GOOD_FAILUREview →
product-recall-stock-price-event103/1030.0%GOOD_FAILUREview →
Data Science: Robustnesslive2 graded · 8/20 resolved · 20 trials

Bias-correction and outlier-robust analysis in R: recover trustworthy estimates from messy data, checked against reference results.

Task#runsresolved k/npass@1dominant verdict
coffee-ratings-outliers104/1040.0%GOOD_FAILUREview →
lending-club-lgd-bias-correction-r104/1040.0%GOOD_FAILUREview →
Product Data Sciencepreview1 graded · 4/10 resolved · 10 trials

Applied product analytics: causal impact and decision analysis on real product datasets (R).

Task#runsresolved k/npass@1dominant verdict
ipl-toss-impact-analysis-r104/1040.0%GOOD_SUCCESSview →
Pharmacometricspreview1 graded · 4/10 resolved · 10 trials

Population PK/PD modeling with nonlinear mixed-effects: fit drug-exposure models and recover the correct parameters.

Task#runsresolved k/npass@1dominant verdict
neonatal-drug-exposure-nlme104/1040.0%GOOD_SUCCESSview →

Reward is each task's tests/test.sh exit code. The grader and reference are withheld from the agent's workspace and restored only for grading, see Methodology.