Core Concepts Overview¶
This section establishes the mental model for Consist before covering API details.
Core Abstractions¶
Artifact: A file with provenance metadata—its path, format, content hash (SHA256), producing run, and ingestion status.
Run: A single execution with tracked inputs, configuration, outputs, status, and timing. Each run has a signature computed from code, config, and inputs that enables cache reuse.
Scenario: A parent run grouping related child runs for multi-variant studies or iterative workflows.
Coupler: A helper that passes artifacts between steps in a scenario, linking lineage automatically.
How Caching Works¶
Consist computes a signature from code version, config, and input artifact hashes:
graph LR
Code[Code Version] --> Hash[SHA256 Signature]
Config[Configuration] --> Hash
Inputs[Input Artifacts] --> Hash
Hash --> Lookup{Cache Lookup}
Lookup -->|Hit| Return[Return Cached Outputs]
Lookup -->|Miss| Execute[Execute & Record New Run]
Same signature → return cached outputs. Different signature → execute and record new lineage.
On cache hits, Consist returns output artifact metadata without copying files. Load or hydrate outputs when you need bytes.
Example: In a parameter sweep testing 20 demand elasticity values, the first run executes preprocessing and the demand model. Runs 2–20 cache-hit on preprocessing (same inputs, same code) but cache-miss on the demand model (different elasticity). Consist skips 19 preprocessing executions.
Provenance & Lineage¶
Provenance: The complete history of a result—code version, configuration, input data, and compute environment. Consist records provenance for every tracked run.
Lineage: The dependency chain showing which run created an artifact, which inputs that run consumed, and which runs produced those inputs.
graph TD
Raw[Raw Data] --> Step1[Run: Clean]
Config1[Threshold=0.5] --> Step1
Step1 --> Art1[Artifact: Cleaned]
Art1 --> Step2[Run: Analyze]
Config2[GroupByKey=category] --> Step2
Step2 --> Art2[Artifact: Summary]
Provenance answers three questions: Can I re-run this exactly? (reproducibility), Which config produced this figure? (accountability), and Why did this change? (debugging).
Example: You published a land-use forecast. A reviewer asks which scenario produced Figure 3. Run consist show <run_id> to see the code version (commit SHA), config parameters, input parcel data, and execution timestamp.
Canonical Terms (Quick Reference)¶
This page keeps one-line canonical definitions. Detailed behavior and policies live in the linked specialized pages.
| Term | Definition | Deep dive |
|---|---|---|
| Signature | Fingerprint of code + config + inputs used for cache lookup. | Caching & Hydration |
| Facet | Queryable metadata subset used for filtering runs (not cache identity). | Config Management |
| Cache hit / miss | Hit reuses prior completed outputs; miss executes and records new lineage. | Caching & Hydration |
| Hydration | Recover artifact metadata/paths without copying bytes. | Caching & Hydration |
| Materialization | Ensure bytes exist in a target location (filesystem or DB path). | Data Materialization |
| Cold / hot data | Cold stays file-based; hot is ingested into DuckDB for SQL queries. | Data Materialization |
| Hybrid view | SQL view that combines ingested rows with file-backed rows. | Data Materialization |
| Ghost mode | Recovery path when files are missing but provenance/ingestion exists. | Caching & Hydration |
| Coupler | Scenario helper for passing step outputs to downstream inputs. | Decorators & Metadata |
For a full term index, see the Glossary.
How Inputs and Outputs Are Treated¶
Inputs are files or values that influence computation. File inputs are hashed by content or metadata depending on the hashing strategy (full vs fast).
Outputs are named artifacts registered via a function's dict[str, Path]
return value together with declared outputs=[...], or explicitly via
output_paths. Consist stores their paths and provenance metadata for lookup
and querying.
Recommended pattern: file-based I/O¶
The recommended pattern is for functions to accept file paths as inputs, write
output files, and return a dict[str, Path] mapping artifact keys to output
paths. Declare the matching artifact keys in outputs=[...] when calling
tracker.run(...); Consist then logs those returned paths as artifacts.
import pandas as pd
from pathlib import Path
def summarize_trips(trips_path: Path) -> dict[str, Path]:
df = pd.read_parquet(trips_path)
out = Path("./summary.parquet")
df.groupby("mode")["distance_miles"].mean().to_frame().to_parquet(out)
return {"summary": out}
result = tracker.run(
fn=summarize_trips,
inputs={"trips_path": trips_artifact}, # path resolved from artifact; hashed for cache identity
outputs=["summary"],
)
The function is a plain Python callable: testable without a tracker, honest about its I/O, and readable without framework knowledge.
When to use output_paths¶
If a function writes files but returns None (e.g., a legacy tool or subprocess wrapper), register outputs explicitly:
result = tracker.run(
fn=run_legacy_model,
inputs={"config": config_artifact},
output_paths={"results": Path("./model_output.csv")},
)
Auto-loading inputs as DataFrames¶
If you prefer to receive a DataFrame directly instead of a path, use input_binding="loaded" and Consist will load the artifact for you before calling the function. This is convenient for short scripts but hides the I/O boundary — prefer input_binding="paths" for pipelines where the function boundary matters.
Next Steps¶
- Config Management — Understand the config vs. facet distinction and when to use each
- Caching & Hydration — Caching patterns and data recovery strategies
- Data Materialization — When to ingest data and use hybrid views
- Grouped Views — Build one view across schema-matched artifacts
- Decorators & Metadata — Defaults, templates, and schema introspection