model flight
correction branch
survivor beacon
no-fly zone
tick 0 / 0
Corrections on this rollout
None yet.
Judge which behavior you'd rather see deployed. Both bad routes the segment to the takeover queue.
queue:
A
B
Takeover queue (from both-bad)
Empty.
Train & Eval — close the loop
Corrections become a new checkpoint; the new checkpoint gets evaluated on held-out seeds; the eval attributes improvement by failure category. This panel is the half of the loop nobody builds.
Before / after — held-out worlds, by failure category
Correction shelf life
Which corrections does the new checkpoint still visit? Stale corrections are the DAgger decay made visible — the reason this platform is a flow, not a dataset.
Training runs
Training runs from the Python / Prime Intellect pipeline appear here once the pipeline has produced artifacts.
Runs Explorer — the platform at scale
Scale — 1M clips/sec and the storage split
Operator corps — quality, calibration, economics
Five simulated operators judge the same gold-standard pairs. Calibration decides each operator's export weight — and the value table asks the guild-vs-crowd question: is one veteran worth more than the crowd?
Calibration (gold tasks)
| operator | archetype | gold accuracy | export weight |
|---|
Guild vs crowd
Value table
| operator | judgments | effective signal | cost units | signal / cost |
|---|
Organic play mining
Free-play runs from Review are auto-paired against model rollouts on the same world. Directed correction is performance under observation; organic play is the Medal thesis — capture people at their best because nobody's watching.
Learned verifier — online Bradley-Terry over trajectory features
Trains live on your Compare judgments. Human preferences are the only supervision signal — this is the verifiable-domain loop in miniature.
Rollout ranking (by current verifier score)
| # | rollout | checkpoint | outcome | score |
|---|
Failure taxonomy by checkpoint
Auto events from the sim plus your annotations. This table is the seed of the eval loop: did corrections in a category reduce failures in that category on the next checkpoint?
Export — schema v0.1 records
Records are the source of truth; training formats are views. The bundle strips internal fields and carries a manifest.