Skip to content

Run lifecycle

A methodic.Run resource handle is bound to a single (experiment, variation, run) triple — get one from chronicle.run(experiment_id, variation, run). The flow:

get_variation_config()
   run.start()
        ├──► run.heartbeat()  (loop, every 60s)
        ├──► run.upload_asset(...) (as outputs are produced)
   run.succeed()    or    run.fail(reason="crash")
        │                      │
        ▼                      ▼
   chronicle.close()      chronicle.close()

Chronicle is constructed once per process (with Chronicle(...) as chronicle: is the easiest pattern); Run handles are cheap to create and discard.

Mutators chain

Lifecycle methods on Run return self, so calls can chain:

chronicle.run(exp_id, 0, 0).start().heartbeat()

States and transitions

State Set by Terminal?
pending Chronicle (run creation) no
running run.start() no
succeeded run.succeed() yes
failed_crash run.fail(reason="crash") yes
failed_abandoned run.fail(reason="abandoned") yes
lost Chronicle (heartbeat timeout) no — recoverable for 48h
failed_lost Chronicle (after 48h of lost) yes

run.succeed() blocks until pending background uploads finish. chronicle.close() (or exiting the context manager) shuts down the shared upload pool and the HTTP session.

Heartbeats

Chronicle has a 15-minute watchdog. Runs that don't heartbeat within that window transition to lost and the worker is torn down (for Chronicle-managed instances). Send heartbeats from a background thread:

import threading
import time

def heartbeat_loop(run, stop_event):
    while not stop_event.is_set():
        run.heartbeat()
        stop_event.wait(60)

stop = threading.Event()
threading.Thread(target=heartbeat_loop, args=(run, stop), daemon=True).start()
# ... training ...
stop.set()

A run that has gone lost can return to running if a heartbeat arrives within 48h. Beyond that, Chronicle marks it failed_lost and the run is terminal.

Failure semantics

  • run.fail(reason="crash") — unrecoverable error inside the runner (exception, OOM, etc.). Chronicle records the reason verbatim.
  • run.fail(reason="abandoned") — user or agent cancelled the run. Distinguished so abandoned runs don't pollute crash metrics.

There is no automatic retry. Chronicle creates a new run (incrementing the run number) if the agent decides to retry; reruns are explicit, not implicit.