Resumable uploads¶
Long training runs produce checkpoints worth gigabytes. If the runner crashes mid-upload, you don't want to start over. methodic ships an UploadTracker — a local SQLite database (WAL mode, thread-safe) that records upload state so a restarted process can resume.
When to use it¶
Use the tracker for any flow where:
- A producer (checkpoint manager, renderer) emits files faster than they upload
- Multiple components form a single logical asset and must finalize together
- A crash mid-upload should not lose progress
Skip it for one-off small uploads — run.upload_asset() and run.create_asset_presigned() are fine on their own.
Pattern¶
from methodic import Chronicle, UploadTracker
from pathlib import Path
tracker = UploadTracker(db_path=Path("./uploads.sqlite"))
with Chronicle(server_url="...", api_key="sk_agent_...") as chronicle:
run = chronicle.run(experiment_id, variation=0, run=0)
# Producer registers files; the upload pool drains them in the background.
run.register_and_upload_async(
local_dir=Path("./out/checkpoint-1000"),
asset_type="checkpoint",
upload_tracker=tracker,
)
# ... continue training; uploads happen in the background ...
run.wait_for_uploads() # block until queue drains
run.succeed() # also waits for pending uploads
Crash recovery¶
On startup, query the tracker for any unfinished uploads from the previous process:
pending = tracker.get_pending_uploads()
if pending:
# Re-upload incomplete components and finalize.
for upload in pending:
... # see menlo-park's _drain_incomplete_uploads for the canonical pattern
Chronicle's asset finalization is idempotent — re-uploading a component that was already PUT to cloud storage is safe; the only cost is bandwidth. run.finalize_asset() is also safe to call twice.
Concurrency model¶
- Multiple producers can register files concurrently — WAL mode handles it.
- One upload pool drains the queue. Configure parallelism via
Chronicle(max_upload_workers=N). The pool is shared across allRunhandles created from the sameChronicleinstance.
Atomicity¶
Asset-level atomicity is enforced by Chronicle, not the tracker. An asset stays in pending state — invisible to consumers — from the moment create_asset_presigned() returns until finalize_asset() succeeds. Components stream into cloud storage one at a time, but the asset isn't usable until finalize lands. That's the boundary.
The tracker contributes two narrower guarantees:
- Batch-atomic registration:
register_components()writes all rows in one SQLite transaction — either every component for the batch lands, or none does. - Persistent state for crash recovery: a restarted process can read what was in flight before the crash and resume.
The tracker does not gate visibility — there is no upload thread polling the tracker for new work. Uploads are dispatched by the same code path that registers, immediately after register_components() returns.
The caller-side rule that follows from this: register all components for a given asset_id in a single register_components() call. The completion check all_components_completed(asset_id) doesn't distinguish "all components currently in the table" from "all components that will ever be registered" — so a split registration races. Concretely:
- Caller registers
[a, b]→ upload thread starts uploading them. a,bcomplete →all_components_completed(asset_id)returns true →finalize_asset()fires.- Caller then registers
[c, d]for the same asset → too late; the asset is already finalized with onlya, b.
run.register_and_upload_async() honors this by registering the full component list in one call.
See the UploadTracker API reference for full method docs.