Skip to content

Resumable uploads

Long training runs produce checkpoints worth gigabytes. If the runner crashes mid-upload, you don't want to start over. methodic ships an UploadTracker — a local SQLite database (WAL mode, thread-safe) that records upload state so a restarted process can resume.

When to use it

Use the tracker for any flow where:

  • A producer (checkpoint manager, renderer) emits files faster than they upload
  • Multiple components form a single logical asset and must finalize together
  • A crash mid-upload should not lose progress

Skip it for one-off small uploads — run.upload_asset() and run.create_asset_presigned() are fine on their own.

Pattern

from methodic import Chronicle, UploadTracker
from pathlib import Path

tracker = UploadTracker(db_path=Path("./uploads.sqlite"))

with Chronicle(server_url="...", api_key="sk_agent_...") as chronicle:
    run = chronicle.run(experiment_id, variation=0, run=0)

    # Producer registers files; the upload pool drains them in the background.
    run.register_and_upload_async(
        local_dir=Path("./out/checkpoint-1000"),
        asset_type="checkpoint",
        upload_tracker=tracker,
    )

    # ... continue training; uploads happen in the background ...

    run.wait_for_uploads()  # block until queue drains
    run.succeed()           # also waits for pending uploads

Crash recovery

On startup, query the tracker for any unfinished uploads from the previous process:

pending = tracker.get_pending_uploads()
if pending:
    # Re-upload incomplete components and finalize.
    for upload in pending:
        ...  # see menlo-park's _drain_incomplete_uploads for the canonical pattern

Chronicle's asset finalization is idempotent — re-uploading a component that was already PUT to cloud storage is safe; the only cost is bandwidth. run.finalize_asset() is also safe to call twice.

Concurrency model

  • Multiple producers can register files concurrently — WAL mode handles it.
  • One upload pool drains the queue. Configure parallelism via Chronicle(max_upload_workers=N). The pool is shared across all Run handles created from the same Chronicle instance.

Atomicity

Asset-level atomicity is enforced by Chronicle, not the tracker. An asset stays in pending state — invisible to consumers — from the moment create_asset_presigned() returns until finalize_asset() succeeds. Components stream into cloud storage one at a time, but the asset isn't usable until finalize lands. That's the boundary.

The tracker contributes two narrower guarantees:

  • Batch-atomic registration: register_components() writes all rows in one SQLite transaction — either every component for the batch lands, or none does.
  • Persistent state for crash recovery: a restarted process can read what was in flight before the crash and resume.

The tracker does not gate visibility — there is no upload thread polling the tracker for new work. Uploads are dispatched by the same code path that registers, immediately after register_components() returns.

The caller-side rule that follows from this: register all components for a given asset_id in a single register_components() call. The completion check all_components_completed(asset_id) doesn't distinguish "all components currently in the table" from "all components that will ever be registered" — so a split registration races. Concretely:

  1. Caller registers [a, b] → upload thread starts uploading them.
  2. a, b complete → all_components_completed(asset_id) returns true → finalize_asset() fires.
  3. Caller then registers [c, d] for the same asset → too late; the asset is already finalized with only a, b.

run.register_and_upload_async() honors this by registering the full component list in one call.

See the UploadTracker API reference for full method docs.