Run lifecycle¶

A methodic.Run resource handle is bound to a single (experiment, variation, run) triple — get one from chronicle.run(experiment_id, variation, run). The flow:

get_variation_config()
        │
        ▼
   run.start()
        │
        ├──► run.heartbeat()  (loop, every 60s)
        │
        ├──► run.upload_asset(...) (as outputs are produced)
        │
        ▼
   run.succeed()    or    run.fail(reason="crash")
        │                      │
        ▼                      ▼
   chronicle.close()      chronicle.close()

Chronicle is constructed once per process (with Chronicle(...) as chronicle: is the easiest pattern); Run handles are cheap to create and discard.

Mutators chain¶

Lifecycle methods on Run return self, so calls can chain:

chronicle.run(exp_id, 0, 0).start().heartbeat()

States and transitions¶

State	Set by	Terminal?
`pending`	Chronicle (run creation)	no
`running`	`run.start()`	no
`succeeded`	`run.succeed()`	yes
`failed_crash`	`run.fail(reason="crash")`	yes
`failed_abandoned`	`run.fail(reason="abandoned")`	yes
`lost`	Chronicle (heartbeat timeout)	no — recoverable for 48h
`failed_lost`	Chronicle (after 48h of `lost`)	yes

run.succeed() blocks until pending background uploads finish. chronicle.close() (or exiting the context manager) shuts down the shared upload pool and the HTTP session.

Heartbeats¶

Chronicle has a 15-minute watchdog. Runs that don't heartbeat within that window transition to lost and the worker is torn down (for Chronicle-managed instances). Send heartbeats from a background thread:

import threading
import time

def heartbeat_loop(run, stop_event):
    while not stop_event.is_set():
        run.heartbeat()
        stop_event.wait(60)

stop = threading.Event()
threading.Thread(target=heartbeat_loop, args=(run, stop), daemon=True).start()
# ... training ...
stop.set()

A run that has gone lost can return to running if a heartbeat arrives within 48h. Beyond that, Chronicle marks it failed_lost and the run is terminal.

Failure semantics¶

run.fail(reason="crash") — unrecoverable error inside the runner (exception, OOM, etc.). Chronicle records the reason verbatim.
run.fail(reason="abandoned") — user or agent cancelled the run. Distinguished so abandoned runs don't pollute crash metrics.

There is no automatic retry. Chronicle creates a new run (incrementing the run number) if the agent decides to retry; reruns are explicit, not implicit.