Architecture¶
Recommended path
For most user workflows, the recommended path is consist.run(...),
consist.trace(...), or consist.scenario(...). Low-level lifecycle snippets
in this page (for example tracker.start_run(...) + tracker.log_artifact(...))
are advanced and primarily explain internal behavior.
How Caching Works¶
Consist identifies runs using a three-part signature:
| Component | Source | Notes |
|---|---|---|
| Code hash | Configurable code identity | Default repo_git uses Git commit/dirty state; callable modes hash the function module or source |
| Config hash | Canonical JSON of config dict | Normalized for key order and numeric types; Pydantic models serialize deterministically |
| Input hash | SHA256 of input content | For Consist artifacts, uses the producing run's signature (Merkle linking); for raw files, hashes bytes or metadata per hashing_strategy |
What Changes Break Cache Hits?¶
| What Changed | Cache Hit? | Why |
|---|---|---|
| Input file content | ❌ No | File hash changes → signature changes |
| Config value | ❌ No | Config hash changes → signature changes |
| Function code | ❌ No | Code hash changes → signature changes |
runtime_kwargs |
✅ Yes | runtime_kwargs are NOT hashed; don't affect signature |
| Output file names | ✅ Yes | Output names don't affect signature |
| Comments in code | Depends | Under repo_git, tracked comment changes affect code hash. Under callable_source, only comments in the callable source matter. Under callable_module, comments in the callable's module matter. |
Merkle DAG structure: Each run's signature incorporates the signatures of its input artifacts' producing runs. This forms a directed acyclic graph where:
- Changing a parameter invalidates only downstream runs that depend on it
- Identical inputs produce cache hits across machines (given the same code version)
- Provenance validity depends on the lineage graph, not file existence
For detailed terminology, see Core Concepts.
Cache Modes¶
| Mode | Behavior |
|---|---|
reuse (default) |
Return cached result if signature matches |
overwrite |
Always execute, update cache with new result |
readonly |
Use cache but don't persist new results (sandbox mode) |
For runnable examples and migration guidance from legacy run-policy kwargs, see Caching & Hydration.
Ghost Mode¶
Consist enables "Ghost Mode" — the ability to delete intermediate files while preserving provenance and recoverability. Content hashes stored in the database let Consist verify cached results, identify which upstream run produced a missing artifact, and re-execute only the steps needed to regenerate it. For the full guide including recovery patterns and best practices, see Caching & Hydration.
Data Model¶
Consist uses two core entities with a many-to-many relationship:
| Entity | Purpose |
|---|---|
Run |
Execution context: model name, config, timestamps, status, parent linkage |
Artifact |
File metadata: path (as URI), content hash, driver, schema reference |
RunArtifactLink |
Connects runs to their input and output artifacts with direction metadata |
Key fields for workflow tracking:
- Run.parent_run_id — Links scenario steps to their parent scenario; used as the scenario identifier in views (see consist_scenario_id)
- Run.year — Simulation year for time-series workflows
- Run.tags — String labels for filtering (stored as JSON array)
- Artifact.hash — SHA256 content hash for deduplication and verification
Consist provides three strategies for tracking configuration: config= (hashed into the cache key, stored as a JSON snapshot), facet= (queryable in DuckDB, does not affect the cache), and identity_inputs= (large external files hashed into the cache key but not stored as content). For full usage, guardrails, and query examples, see Config, Facets, and Identity Inputs.
Dual-Write Persistence¶
Consist maintains two synchronized records for resilience:
graph TD
Tracker[Tracker] --> Context[Active Run Context]
Context --> Memo[In-Memory Model]
Memo --> Snapshot[consist.json Snapshot]
Memo --> DB[(DuckDB Database)]
Snapshot --- Source[Source of Truth]
DB --- Query[Query Engine]
Write order (safety guarantee):
- Update in-memory model
- Flush to
consist.json(atomic write) ← Source of truth - Attempt DB sync (catch errors, log warning, never crash)
JSON snapshots (consist.json per run): Portable, human-readable, version-controllable. Each run directory contains a complete record that survives database corruption.
DuckDB database: Enables fast queries across runs, artifacts, and lineage. Can be rebuilt from JSON snapshots if needed. Handles concurrent access with retry logic.
Path Virtualization¶
Absolute paths break portability. Consist stores relative URIs and resolves them at runtime. For a focused guide, see Mounts & Portability.
User logs: /mnt/data/land_use.csv
↓
Tracker detects mount: mounts={"inputs": "/mnt/data"}
↓
Stored URI: inputs://land_use.csv
Run-local paths: For run-specific output directories, Consist typically stores
paths relative to the run directory (for example ./outputs/...). Historical
resolution also accepts workspace://... aliases and uses metadata stored in
Run.meta["_physical_run_dir"].
Data Virtualization¶
Consist creates hybrid views that transparently union ingested ("hot") DuckDB rows with raw files ("cold") on disk, so you can query across runs without loading everything into memory. Register a SQLModel schema with Tracker(schemas=[MySchema]) to activate a view that includes provenance columns (consist_run_id, consist_year, consist_scenario_id, consist_artifact_id) for filtering and grouping.
For N-dimensional data, MatrixViewFactory currently loads tracked Zarr
stores as lazy xarray Datasets concatenated along a run_id dimension with
year/iteration as coordinates. NetCDF metadata can still be cataloged
elsewhere in Consist, but Tracker.load_matrix(...) is currently Zarr-focused.
For schemas, ingestion, and query examples, see DLT Loader Guide and Schema Export.
Container Integration¶
Containers are treated as pure functions. The cache signature extends the
standard formula with container-specific components including image digest,
command, environment hash, backend, working directory, declared host volumes,
and backend-specific extra args:
Current caveat: because the container config includes resolved host volume
paths, cross-machine cache reuse requires those host roots to stay stable too.
Changing /shared/team_inputs to /mnt/nfs/team_inputs changes the signature
even if the in-container mount point remains /inputs.
Supported backends: Docker, Singularity/Apptainer. For usage details, see Container Integration Guide.
Event Hooks¶
Consist exposes on_run_complete and on_run_failed callbacks for external integrations (notifications, OpenLineage, etc.). Hook failures are logged but never crash the run. See API Reference for callback signatures and registration.
Context Stack¶
Consist maintains a context-local stack of active trackers, allowing nested contexts and implicit tracker resolution:
import consist
from consist import use_tracker
with use_tracker(tracker):
with consist.scenario("baseline") as sc:
# consist.log_artifact() finds the active tracker automatically
with sc.step(name="simulate"):
consist.log_dataframe(df, key="results") # No tracker= needed
This enables clean APIs where most functions don't require explicit tracker parameters.