SyncVals · Tasks

Tasks by category

Every task is a self-contained Harbor-shaped directory, a problem statement, an executable environment, a hidden reference solution, and a deterministic verifier. Browse by category, then open any task to read the full problem, the starter workspace the agent saw, the withheld grader, and every recorded run. Category status is derived from the data, not asserted.

Software Engineeringlive

Algorithms, data structures, bug-fixes, and API/systems implementation, graded against hidden test suites the agent never sees.

8 task(s) · 49/85 resolved · 85 trials

Mechanical Engineeringlive

Structural and numerical solvers (FD/FE, contact dynamics) in C++, checked against multi-binary hidden references.

8 task(s) · 28/80 resolved · 80 trials

Cloud Operationslive

Provision, diagnose, and guardrail cloud infrastructure (AWS), verified against the deployed state.

10 task(s) · 32/100 resolved · 100 trials

Electrical Engineeringlive

RTL / digital-logic design checked under hardened simulation and formal-equivalence harnesses.

7 task(s) · 20/28 resolved · 28 trials

STEMlive

Quantitative reasoning across math, physics, and the natural sciences with deterministic, checkable answers.

5 task(s) · 59/81 resolved · 81 trials

Gamelive

Interactive simulation and game-logic tasks graded on exact state transitions and rule fidelity.

10 task(s) · 46/105 resolved · 105 trials

Cyber Securitylive

Vulnerability discovery, exploitation, and remediation verified against a concrete security objective.

9 task(s) · 24/80 resolved · 80 trials

Debuggingpreview

Real-world bug-fix tasks distilled from open-source issues (esbuild, klauspost/compress, rust-lang/semver), graded against each project's own tests.

3 task(s) · 15 scored trial(s) · preview

ML Engineeringpreview

Train and tune models to a target metric on real engineering datasets, graded on held-out performance against a solved-reward threshold.

1 task(s) · 10 scored trial(s) · preview

Scientific MLlive

Physics-informed and surrogate modeling: PDE forecasting and CFD/FEA prediction, graded on quantitative predictive accuracy.

3 task(s) · 10/30 resolved · 30 trials

Data Sciencelive

Modeling and statistical inference on real datasets (federated learning, event studies), graded against held-out ground truth.

2 task(s) · 5/20 resolved · 20 trials

Data Science: Robustnesslive

Bias-correction and outlier-robust analysis in R: recover trustworthy estimates from messy data, checked against reference results.

2 task(s) · 8/20 resolved · 20 trials

Product Data Sciencepreview

Applied product analytics: causal impact and decision analysis on real product datasets (R).

1 task(s) · 10 scored trial(s) · preview

Pharmacometricspreview

Population PK/PD modeling with nonlinear mixed-effects: fit drug-exposure models and recover the correct parameters.

1 task(s) · 10 scored trial(s) · preview

Outcome barGOOD SUCCESSBAD SUCCESSGOOD FAILUREBAD FAILUREHARNESS ERRORgreen = honest, red = reward-hack / bad task, amber = harness/budget

Software Engineeringlive8 graded · 49/85 resolved · 85 trials

Algorithms, data structures, bug-fixes, and API/systems implementation, graded against hidden test suites the agent never sees.

Task	#runs	resolved k/n	pass@1	dominant verdict
diff-patch-engine	10	0/10	0.0%	GOOD_FAILURE	view →
idempotency-middleware	10	4/10	40.0%	GOOD_FAILURE	view →
lru-cache	15	14/15	93.3%	GOOD_SUCCESS	view →
occ-conditional-store	10	9/10	90.0%	GOOD_SUCCESS	view →
rate-limiter	10	2/10	20.0%	BAD_FAILURE	view →
resilient-http-client	10	0/10	0.0%	GOOD_FAILURE	view →
session-token-verify	10	10/10	100.0%	GOOD_SUCCESS	view →
window-aggregate-store	10	10/10	100.0%	GOOD_SUCCESS	view →

Mechanical Engineeringlive8 graded · 28/80 resolved · 80 trials

Structural and numerical solvers (FD/FE, contact dynamics) in C++, checked against multi-binary hidden references.

Task	#runs	resolved k/n	pass@1	dominant verdict
beam-deflection-solver	10	3/10	30.0%	GOOD_FAILURE	view →
collision2d-impulse-solver	10	0/10	0.0%	GOOD_FAILURE	view →
heat1d-conduction-solver	10	3/10	30.0%	GOOD_FAILURE	view →
pipeflow-colebrook-solver	10	4/10	40.0%	GOOD_FAILURE	view →
projectile-drag-integrator	10	2/10	20.0%	GOOD_FAILURE	view →
quaternion-rotation-integrator	10	0/10	0.0%	GOOD_FAILURE	view →
rk4-orbit-integrator	10	10/10	100.0%	GOOD_SUCCESS	view →
truss2d-solver	10	6/10	60.0%	GOOD_SUCCESS	view →

Cloud Operationslive10 graded · 32/100 resolved · 100 trials

Provision, diagnose, and guardrail cloud infrastructure (AWS), verified against the deployed state.

Task	#runs	resolved k/n	pass@1	dominant verdict
apigw-sqs-fifo-direct-integration	10	2/10	20.0%	GOOD_FAILURE	view →
ddb-outbox-eventbridge-fanout	10	4/10	40.0%	GOOD_FAILURE	view →
iam-cross-account-externalid-sourcearn	10	3/10	30.0%	BAD_FAILURE	view →
iam-revoke-older-sessions	10	4/10	40.0%	GOOD_SUCCESS	view →
iam-session-tag-tenant-scope	10	3/10	30.0%	GOOD_FAILURE	view →
s3-lambda-ddb-pipeline	10	2/10	20.0%	HARNESS_ERROR	view →
s3-sqs-image-pipeline-kms	10	2/10	20.0%	GOOD_FAILURE	view →
secrets-rotation-kms	10	6/10	60.0%	GOOD_SUCCESS	view →
sfn-saga-compensation-orchestrator	10	5/10	50.0%	GOOD_SUCCESS	view →
sfn-secrets-rotation-chain	10	1/10	10.0%	GOOD_FAILURE	view →

Electrical Engineeringlive7 graded · 20/28 resolved · 28 trials

RTL / digital-logic design checked under hardened simulation and formal-equivalence harnesses.

Task	#runs	resolved k/n	pass@1	dominant verdict
Enable-gated streaming fold stage	3	3/3	100.0%	GOOD_SUCCESS	view →
Instruction-retire commit handshake	3	3/3	100.0%	BAD_SUCCESS	view →
Multi-cycle signed divider with a start/valid handshake	3	0/3	0.0%	GOOD_FAILURE	view →
Resynchronising serial byte receiver	3	0/3	0.0%	BAD_FAILURE	view →
Serial bit-destuff framer	3	3/3	100.0%	GOOD_SUCCESS	view →
Wait-state register-file completer	3	3/3	100.0%	GOOD_SUCCESS	view →
open-drain-command-engine	10	8/10	80.0%	GOOD_SUCCESS	view →

STEMlive5 graded · 59/81 resolved · 81 trials

Quantitative reasoning across math, physics, and the natural sciences with deterministic, checkable answers.

Task	#runs	resolved k/n	pass@1	dominant verdict
adaptive-quadrature	10	10/10	100.0%	GOOD_SUCCESS	view →
anova-stats	11	11/11	100.0%	GOOD_SUCCESS	view →
cg-solver	40	28/40	70.0%	GOOD_SUCCESS	view →
cholesky-solver	10	0/10	0.0%	GOOD_FAILURE	view →
cubic-spline	10	10/10	100.0%	GOOD_SUCCESS	view →

Gamelive10 graded · 46/105 resolved · 105 trials

Interactive simulation and game-logic tasks graded on exact state transitions and rule fidelity.

Task	#runs	resolved k/n	pass@1	dominant verdict
game-of-life-step	15	15/15	100.0%	GOOD_SUCCESS	view →
task_0026_codex_camera_shake_rig	10	4/10	40.0%	HARNESS_ERROR	view →
task_0040_day_night_cycle_controller	10	1/10	10.0%	GOOD_FAILURE	view →
task_0053_car_scene_assembly	10	6/10	60.0%	GOOD_SUCCESS	view →
task_0054_assemble_crusader_animatedsprite	10	4/10	40.0%	GOOD_FAILURE	view →
task_0108_track_particles_driven_by_tile_type_gradient	10	2/10	20.0%	HARNESS_ERROR	view →
task_0131_minimap_ui_complex	10	0/10	0.0%	GOOD_FAILURE	view →
task_0132_minimap_marker_logic_complex	10	4/10	40.0%	GOOD_FAILURE	view →
task_9001_checkpoint_system	10	4/10	40.0%	GOOD_FAILURE	view →
task_9002_combo_score_system	10	6/10	60.0%	GOOD_SUCCESS	view →

Cyber Securitylive9 graded · 24/80 resolved · 80 trials

Vulnerability discovery, exploitation, and remediation verified against a concrete security objective.

Task	#runs	resolved k/n	pass@1	dominant verdict
59-fix-broken-cognito-m2m-httpapi-jwt-scope-gated	10	1/10	10.0%	GOOD_FAILURE	view →
60-fix-broken-ecs-fargate-secrets-kms-exec-role	7	6/7	85.7%	GOOD_SUCCESS	view →
apigw-http-api-jwt-authorizer-lambda-integration	10	3/10	30.0%	BAD_FAILURE	view →
athena-workgroup-result-encryption-cmk-enforced	5	2/5	40.0%	GOOD_FAILURE	view →
ecr-image-scan-lifecycle-immutable-tags-replication	9	3/9	33.3%	GOOD_FAILURE	view →
efs-access-point-posix-iam-mount-target	10	3/10	30.0%	BAD_FAILURE	view →
fix-broken-appsync-graphql-cognito-resolver-cache-leak	10	1/10	10.0%	GOOD_FAILURE	view →
glue-etl-catalog-security-configuration-kms	9	3/9	33.3%	GOOD_FAILURE	view →
iam-permissions-boundary-ceiling	10	2/10	20.0%	GOOD_FAILURE	view →

Debuggingpreview3 graded · 15/15 resolved · 15 trials

Real-world bug-fix tasks distilled from open-source issues (esbuild, klauspost/compress, rust-lang/semver), graded against each project's own tests.

Task	#runs	resolved k/n	pass@1	dominant verdict
evanw-esbuild-4417	5	5/5	100.0%	GOOD_SUCCESS	view →
klauspost-compress-1115	5	5/5	100.0%	GOOD_SUCCESS	view →
rust-lang-semver-305	5	5/5	100.0%	GOOD_SUCCESS	view →

ML Engineeringpreview1 graded · 4/10 resolved · 10 trials

Train and tune models to a target metric on real engineering datasets, graded on held-out performance against a solved-reward threshold.

Task	#runs	resolved k/n	pass@1	dominant verdict
airfoil-self-noise	10	4/10	40.0%	GOOD_FAILURE	view →

Scientific MLlive3 graded · 10/30 resolved · 30 trials

Physics-informed and surrogate modeling: PDE forecasting and CFD/FEA prediction, graded on quantitative predictive accuracy.

Task	#runs	resolved k/n	pass@1	dominant verdict
airfrans-high-reynolds-drag-extrapolation	10	4/10	40.0%	GOOD_FAILURE	view →
ks-equation-1d-forecast	10	4/10	40.0%	GOOD_FAILURE	view →
simjeb-bracket-fea-mass-prediction-real	10	2/10	20.0%	GOOD_FAILURE	view →

Data Sciencelive2 graded · 5/20 resolved · 20 trials

Modeling and statistical inference on real datasets (federated learning, event studies), graded against held-out ground truth.

Task	#runs	resolved k/n	pass@1	dominant verdict
fedavg-federated-noniid-mnist	10	2/10	20.0%	GOOD_FAILURE	view →
product-recall-stock-price-event	10	3/10	30.0%	GOOD_FAILURE	view →

Data Science: Robustnesslive2 graded · 8/20 resolved · 20 trials

Bias-correction and outlier-robust analysis in R: recover trustworthy estimates from messy data, checked against reference results.

Task	#runs	resolved k/n	pass@1	dominant verdict
coffee-ratings-outliers	10	4/10	40.0%	GOOD_FAILURE	view →
lending-club-lgd-bias-correction-r	10	4/10	40.0%	GOOD_FAILURE	view →

Product Data Sciencepreview1 graded · 4/10 resolved · 10 trials

Applied product analytics: causal impact and decision analysis on real product datasets (R).

Task	#runs	resolved k/n	pass@1	dominant verdict
ipl-toss-impact-analysis-r	10	4/10	40.0%	GOOD_SUCCESS	view →

Pharmacometricspreview1 graded · 4/10 resolved · 10 trials

Population PK/PD modeling with nonlinear mixed-effects: fit drug-exposure models and recover the correct parameters.

Task	#runs	resolved k/n	pass@1	dominant verdict
neonatal-drug-exposure-nlme	10	4/10	40.0%	GOOD_SUCCESS	view →

Reward is each task's tests/test.sh exit code. The grader and reference are withheld from the agent's workspace and restored only for grading, see Methodology.