Every task is a self-contained Harbor-shaped directory, a problem statement, an executable environment, a hidden reference solution, and a deterministic verifier. Browse by category, then open any task to read the full problem, the starter workspace the agent saw, the withheld grader, and every recorded run. Category status is derived from the data, not asserted.
Algorithms, data structures, bug-fixes, and API/systems implementation, graded against hidden test suites the agent never sees.
Structural and numerical solvers (FD/FE, contact dynamics) in C++, checked against multi-binary hidden references.
Provision, diagnose, and guardrail cloud infrastructure (AWS), verified against the deployed state.
RTL / digital-logic design checked under hardened simulation and formal-equivalence harnesses.
Quantitative reasoning across math, physics, and the natural sciences with deterministic, checkable answers.
Interactive simulation and game-logic tasks graded on exact state transitions and rule fidelity.
Vulnerability discovery, exploitation, and remediation verified against a concrete security objective.
Real-world bug-fix tasks distilled from open-source issues (esbuild, klauspost/compress, rust-lang/semver), graded against each project's own tests.
Train and tune models to a target metric on real engineering datasets, graded on held-out performance against a solved-reward threshold.
Physics-informed and surrogate modeling: PDE forecasting and CFD/FEA prediction, graded on quantitative predictive accuracy.
Modeling and statistical inference on real datasets (federated learning, event studies), graded against held-out ground truth.
Bias-correction and outlier-robust analysis in R: recover trustworthy estimates from messy data, checked against reference results.
Applied product analytics: causal impact and decision analysis on real product datasets (R).
Population PK/PD modeling with nonlinear mixed-effects: fit drug-exposure models and recover the correct parameters.
Algorithms, data structures, bug-fixes, and API/systems implementation, graded against hidden test suites the agent never sees.
| Task | #runs | resolved k/n | pass@1 | dominant verdict | |
|---|---|---|---|---|---|
| diff-patch-engine | 10 | 0/10 | 0.0% | GOOD_FAILURE | view → |
| idempotency-middleware | 10 | 4/10 | 40.0% | GOOD_FAILURE | view → |
| lru-cache | 15 | 14/15 | 93.3% | GOOD_SUCCESS | view → |
| occ-conditional-store | 10 | 9/10 | 90.0% | GOOD_SUCCESS | view → |
| rate-limiter | 10 | 2/10 | 20.0% | BAD_FAILURE | view → |
| resilient-http-client | 10 | 0/10 | 0.0% | GOOD_FAILURE | view → |
| session-token-verify | 10 | 10/10 | 100.0% | GOOD_SUCCESS | view → |
| window-aggregate-store | 10 | 10/10 | 100.0% | GOOD_SUCCESS | view → |
Structural and numerical solvers (FD/FE, contact dynamics) in C++, checked against multi-binary hidden references.
| Task | #runs | resolved k/n | pass@1 | dominant verdict | |
|---|---|---|---|---|---|
| beam-deflection-solver | 10 | 3/10 | 30.0% | GOOD_FAILURE | view → |
| collision2d-impulse-solver | 10 | 0/10 | 0.0% | GOOD_FAILURE | view → |
| heat1d-conduction-solver | 10 | 3/10 | 30.0% | GOOD_FAILURE | view → |
| pipeflow-colebrook-solver | 10 | 4/10 | 40.0% | GOOD_FAILURE | view → |
| projectile-drag-integrator | 10 | 2/10 | 20.0% | GOOD_FAILURE | view → |
| quaternion-rotation-integrator | 10 | 0/10 | 0.0% | GOOD_FAILURE | view → |
| rk4-orbit-integrator | 10 | 10/10 | 100.0% | GOOD_SUCCESS | view → |
| truss2d-solver | 10 | 6/10 | 60.0% | GOOD_SUCCESS | view → |
Provision, diagnose, and guardrail cloud infrastructure (AWS), verified against the deployed state.
| Task | #runs | resolved k/n | pass@1 | dominant verdict | |
|---|---|---|---|---|---|
| apigw-sqs-fifo-direct-integration | 10 | 2/10 | 20.0% | GOOD_FAILURE | view → |
| ddb-outbox-eventbridge-fanout | 10 | 4/10 | 40.0% | GOOD_FAILURE | view → |
| iam-cross-account-externalid-sourcearn | 10 | 3/10 | 30.0% | BAD_FAILURE | view → |
| iam-revoke-older-sessions | 10 | 4/10 | 40.0% | GOOD_SUCCESS | view → |
| iam-session-tag-tenant-scope | 10 | 3/10 | 30.0% | GOOD_FAILURE | view → |
| s3-lambda-ddb-pipeline | 10 | 2/10 | 20.0% | HARNESS_ERROR | view → |
| s3-sqs-image-pipeline-kms | 10 | 2/10 | 20.0% | GOOD_FAILURE | view → |
| secrets-rotation-kms | 10 | 6/10 | 60.0% | GOOD_SUCCESS | view → |
| sfn-saga-compensation-orchestrator | 10 | 5/10 | 50.0% | GOOD_SUCCESS | view → |
| sfn-secrets-rotation-chain | 10 | 1/10 | 10.0% | GOOD_FAILURE | view → |
RTL / digital-logic design checked under hardened simulation and formal-equivalence harnesses.
| Task | #runs | resolved k/n | pass@1 | dominant verdict | |
|---|---|---|---|---|---|
| Enable-gated streaming fold stage | 3 | 3/3 | 100.0% | GOOD_SUCCESS | view → |
| Instruction-retire commit handshake | 3 | 3/3 | 100.0% | BAD_SUCCESS | view → |
| Multi-cycle signed divider with a start/valid handshake | 3 | 0/3 | 0.0% | GOOD_FAILURE | view → |
| Resynchronising serial byte receiver | 3 | 0/3 | 0.0% | BAD_FAILURE | view → |
| Serial bit-destuff framer | 3 | 3/3 | 100.0% | GOOD_SUCCESS | view → |
| Wait-state register-file completer | 3 | 3/3 | 100.0% | GOOD_SUCCESS | view → |
| open-drain-command-engine | 10 | 8/10 | 80.0% | GOOD_SUCCESS | view → |
Quantitative reasoning across math, physics, and the natural sciences with deterministic, checkable answers.
| Task | #runs | resolved k/n | pass@1 | dominant verdict | |
|---|---|---|---|---|---|
| adaptive-quadrature | 10 | 10/10 | 100.0% | GOOD_SUCCESS | view → |
| anova-stats | 11 | 11/11 | 100.0% | GOOD_SUCCESS | view → |
| cg-solver | 40 | 28/40 | 70.0% | GOOD_SUCCESS | view → |
| cholesky-solver | 10 | 0/10 | 0.0% | GOOD_FAILURE | view → |
| cubic-spline | 10 | 10/10 | 100.0% | GOOD_SUCCESS | view → |
Interactive simulation and game-logic tasks graded on exact state transitions and rule fidelity.
| Task | #runs | resolved k/n | pass@1 | dominant verdict | |
|---|---|---|---|---|---|
| game-of-life-step | 15 | 15/15 | 100.0% | GOOD_SUCCESS | view → |
| task_0026_codex_camera_shake_rig | 10 | 4/10 | 40.0% | HARNESS_ERROR | view → |
| task_0040_day_night_cycle_controller | 10 | 1/10 | 10.0% | GOOD_FAILURE | view → |
| task_0053_car_scene_assembly | 10 | 6/10 | 60.0% | GOOD_SUCCESS | view → |
| task_0054_assemble_crusader_animatedsprite | 10 | 4/10 | 40.0% | GOOD_FAILURE | view → |
| task_0108_track_particles_driven_by_tile_type_gradient | 10 | 2/10 | 20.0% | HARNESS_ERROR | view → |
| task_0131_minimap_ui_complex | 10 | 0/10 | 0.0% | GOOD_FAILURE | view → |
| task_0132_minimap_marker_logic_complex | 10 | 4/10 | 40.0% | GOOD_FAILURE | view → |
| task_9001_checkpoint_system | 10 | 4/10 | 40.0% | GOOD_FAILURE | view → |
| task_9002_combo_score_system | 10 | 6/10 | 60.0% | GOOD_SUCCESS | view → |
Vulnerability discovery, exploitation, and remediation verified against a concrete security objective.
| Task | #runs | resolved k/n | pass@1 | dominant verdict | |
|---|---|---|---|---|---|
| 59-fix-broken-cognito-m2m-httpapi-jwt-scope-gated | 10 | 1/10 | 10.0% | GOOD_FAILURE | view → |
| 60-fix-broken-ecs-fargate-secrets-kms-exec-role | 7 | 6/7 | 85.7% | GOOD_SUCCESS | view → |
| apigw-http-api-jwt-authorizer-lambda-integration | 10 | 3/10 | 30.0% | BAD_FAILURE | view → |
| athena-workgroup-result-encryption-cmk-enforced | 5 | 2/5 | 40.0% | GOOD_FAILURE | view → |
| ecr-image-scan-lifecycle-immutable-tags-replication | 9 | 3/9 | 33.3% | GOOD_FAILURE | view → |
| efs-access-point-posix-iam-mount-target | 10 | 3/10 | 30.0% | BAD_FAILURE | view → |
| fix-broken-appsync-graphql-cognito-resolver-cache-leak | 10 | 1/10 | 10.0% | GOOD_FAILURE | view → |
| glue-etl-catalog-security-configuration-kms | 9 | 3/9 | 33.3% | GOOD_FAILURE | view → |
| iam-permissions-boundary-ceiling | 10 | 2/10 | 20.0% | GOOD_FAILURE | view → |
Real-world bug-fix tasks distilled from open-source issues (esbuild, klauspost/compress, rust-lang/semver), graded against each project's own tests.
| Task | #runs | resolved k/n | pass@1 | dominant verdict | |
|---|---|---|---|---|---|
| evanw-esbuild-4417 | 5 | 5/5 | 100.0% | GOOD_SUCCESS | view → |
| klauspost-compress-1115 | 5 | 5/5 | 100.0% | GOOD_SUCCESS | view → |
| rust-lang-semver-305 | 5 | 5/5 | 100.0% | GOOD_SUCCESS | view → |
Train and tune models to a target metric on real engineering datasets, graded on held-out performance against a solved-reward threshold.
| Task | #runs | resolved k/n | pass@1 | dominant verdict | |
|---|---|---|---|---|---|
| airfoil-self-noise | 10 | 4/10 | 40.0% | GOOD_FAILURE | view → |
Physics-informed and surrogate modeling: PDE forecasting and CFD/FEA prediction, graded on quantitative predictive accuracy.
| Task | #runs | resolved k/n | pass@1 | dominant verdict | |
|---|---|---|---|---|---|
| airfrans-high-reynolds-drag-extrapolation | 10 | 4/10 | 40.0% | GOOD_FAILURE | view → |
| ks-equation-1d-forecast | 10 | 4/10 | 40.0% | GOOD_FAILURE | view → |
| simjeb-bracket-fea-mass-prediction-real | 10 | 2/10 | 20.0% | GOOD_FAILURE | view → |
Modeling and statistical inference on real datasets (federated learning, event studies), graded against held-out ground truth.
| Task | #runs | resolved k/n | pass@1 | dominant verdict | |
|---|---|---|---|---|---|
| fedavg-federated-noniid-mnist | 10 | 2/10 | 20.0% | GOOD_FAILURE | view → |
| product-recall-stock-price-event | 10 | 3/10 | 30.0% | GOOD_FAILURE | view → |
Bias-correction and outlier-robust analysis in R: recover trustworthy estimates from messy data, checked against reference results.
| Task | #runs | resolved k/n | pass@1 | dominant verdict | |
|---|---|---|---|---|---|
| coffee-ratings-outliers | 10 | 4/10 | 40.0% | GOOD_FAILURE | view → |
| lending-club-lgd-bias-correction-r | 10 | 4/10 | 40.0% | GOOD_FAILURE | view → |
Applied product analytics: causal impact and decision analysis on real product datasets (R).
| Task | #runs | resolved k/n | pass@1 | dominant verdict | |
|---|---|---|---|---|---|
| ipl-toss-impact-analysis-r | 10 | 4/10 | 40.0% | GOOD_SUCCESS | view → |
Population PK/PD modeling with nonlinear mixed-effects: fit drug-exposure models and recover the correct parameters.
| Task | #runs | resolved k/n | pass@1 | dominant verdict | |
|---|---|---|---|---|---|
| neonatal-drug-exposure-nlme | 10 | 4/10 | 40.0% | GOOD_SUCCESS | view → |
Reward is each task's tests/test.sh exit code. The grader and reference are withheld from the agent's workspace and restored only for grading, see Methodology.