VERIFIER v1.0

human corrections for world-model rollouts · drone search & rescue sim · determinism: replayable
Select a rollout.
model flight correction branch survivor beacon no-fly zone
TAKEOVER LIVE — fly the drone: WASD / arrows. Reaching the survivor or an adverse event ends the branch. Esc aborts.
tick 0 / 0

Corrections on this rollout

None yet.
Judge which behavior you'd rather see deployed. Both bad routes the segment to the takeover queue.
queue:

A

B

margin:

Takeover queue (from both-bad)

Empty.

Train & Eval — close the loop

Corrections become a new checkpoint; the new checkpoint gets evaluated on held-out seeds; the eval attributes improvement by failure category. This panel is the half of the loop nobody builds.
No corrections yet — fly takeovers in Review, or synthesize expert corrections to simulate volume.

Before / after — held-out worlds, by failure category

Correction shelf life

Which corrections does the new checkpoint still visit? Stale corrections are the DAgger decay made visible — the reason this platform is a flow, not a dataset.

Training runs

Training runs from the Python / Prime Intellect pipeline appear here once the pipeline has produced artifacts.

Runs Explorer — the platform at scale

Scale — 1M clips/sec and the storage split

Operator corps — quality, calibration, economics

Five simulated operators judge the same gold-standard pairs. Calibration decides each operator's export weight — and the value table asks the guild-vs-crowd question: is one veteran worth more than the crowd?

Calibration (gold tasks)

operatorarchetypegold accuracyexport weight

Guild vs crowd

Value table

operatorjudgmentseffective signalcost unitssignal / cost

Organic play mining

Free-play runs from Review are auto-paired against model rollouts on the same world. Directed correction is performance under observation; organic play is the Medal thesis — capture people at their best because nobody's watching.
No free-play rollouts yet — press F in Review and just fly.

Learned verifier — online Bradley-Terry over trajectory features

Trains live on your Compare judgments. Human preferences are the only supervision signal — this is the verifiable-domain loop in miniature.

Rollout ranking (by current verifier score)

#rolloutcheckpointoutcomescore

Failure taxonomy by checkpoint

Auto events from the sim plus your annotations. This table is the seed of the eval loop: did corrections in a category reduce failures in that category on the next checkpoint?

Export — schema v0.1 records

Records are the source of truth; training formats are views. The bundle strips internal fields and carries a manifest.

  

Annotate correction