Run lifecycle¶
A methodic.Run resource handle is bound to a single (experiment, variation, run) triple — get one from chronicle.run(experiment_id, variation, run). The flow:
get_variation_config()
│
▼
run.start()
│
├──► run.heartbeat() (loop, every 60s)
│
├──► run.upload_asset(...) (as outputs are produced)
│
▼
run.succeed() or run.fail(reason="crash")
│ │
▼ ▼
chronicle.close() chronicle.close()
Chronicle is constructed once per process (with Chronicle(...) as chronicle: is the easiest pattern); Run handles are cheap to create and discard.
Mutators chain¶
Lifecycle methods on Run return self, so calls can chain:
States and transitions¶
| State | Set by | Terminal? |
|---|---|---|
pending |
Chronicle (run creation) | no |
running |
run.start() |
no |
succeeded |
run.succeed() |
yes |
failed_crash |
run.fail(reason="crash") |
yes |
failed_abandoned |
run.fail(reason="abandoned") |
yes |
lost |
Chronicle (heartbeat timeout) | no — recoverable for 48h |
failed_lost |
Chronicle (after 48h of lost) |
yes |
run.succeed() blocks until pending background uploads finish. chronicle.close() (or exiting the context manager) shuts down the shared upload pool and the HTTP session.
Heartbeats¶
Chronicle has a 15-minute watchdog. Runs that don't heartbeat within that window transition to lost and the worker is torn down (for Chronicle-managed instances). Send heartbeats from a background thread:
import threading
import time
def heartbeat_loop(run, stop_event):
while not stop_event.is_set():
run.heartbeat()
stop_event.wait(60)
stop = threading.Event()
threading.Thread(target=heartbeat_loop, args=(run, stop), daemon=True).start()
# ... training ...
stop.set()
A run that has gone lost can return to running if a heartbeat arrives within 48h. Beyond that, Chronicle marks it failed_lost and the run is terminal.
Failure semantics¶
run.fail(reason="crash")— unrecoverable error inside the runner (exception, OOM, etc.). Chronicle records the reason verbatim.run.fail(reason="abandoned")— user or agent cancelled the run. Distinguished so abandoned runs don't pollute crash metrics.
There is no automatic retry. Chronicle creates a new run (incrementing the run number) if the agent decides to retry; reruns are explicit, not implicit.