Glossary¶
This page defines key terms used in Consist documentation. For a conceptual introduction with dependency ordering, see Core Concepts Overview.
Artifact¶
A file (CSV, Parquet, HDF5, etc.) produced or used by a run, with metadata attached to track its origin and integrity. Think of it as a "named, tracked file" that Consist remembers who created it and can verify hasn't been corrupted.
Artifacts record: - The file path and format (CSV, Parquet, HDF5, etc.) - Which run created or used it (for inputs) - A content hash (SHA256) for integrity checking - Optional ingestion status (whether it was stored in DuckDB)
Research example: When you publish results, each output file is an Artifact. You can trace any file back to the exact code version and config that produced it using the CLI. See Usage Guide for more.
See also: Run, Provenance, Ingestion
Cache Hit¶
When Consist skips execution because it finds a previous run with an identical signature. The cached results (artifact metadata and optionally file copies) are returned instantly.
Example: You run a function with config {"threshold": 0.5} on input data.csv, then re-run with the exact same config and input. Second execution is a cache hit—no computation happens.
Research example: In a parameter sweep, Consist skips preprocessing steps that haven't changed, saving hours of compute time. See the Usage Guide motivation section for realistic time-saving scenarios.
Opposite: Cache miss (run must execute)
See also: Signature, Cache Miss
Cache Miss¶
When Consist cannot find a matching cached result, so the function must execute. The outputs are recorded as a new run and new artifacts are created.
See also: Cache Hit, Signature
Canonical Hashing¶
Converting configuration data (dicts, YAML, etc.) into a single, consistent fingerprint, regardless of field order or how numbers are formatted. This ensures {"a": 1, "b": 2} and {"b": 2, "a": 1} produce the same hash, so Consist treats them as identical configurations.
Why it matters: Without canonical hashing, the same config in different orders would produce different cache keys, breaking reproducibility.
See also: Signature, Config
Config¶
Dictionary of parameters that affect computation, and are included in the cache signature. Example:
import consist
from consist import use_tracker
with use_tracker(tracker):
consist.run(
fn=my_model,
config={"year": 2030, "scenario": "baseline"}, # Part of signature
inputs={...},
outputs=[...]
)
Changing config invalidates cache and triggers re-execution. Config is hashed (not stored as-is) to allow large nested dictionaries.
Research example: If a colleague changes a mode choice coefficient in a 50MB parameter file, Consist detects the change and automatically knows to re-run the affected demand model.
See also: Facet, Signature
Coupler¶
A Consist pattern for orchestrating multi-step workflows. A coupler passes artifacts from one step's outputs to the next step's inputs, linking them through scenario trees. Useful for complex loops and iterative computations.
Example: In a transportation model with feedback loops (trip distribution → mode choice → assignment → congestion update), couplers link each step's outputs to the next step's inputs.
See also: Scenario, Run, Trace
DLT (Data Load Tool)¶
An optional Python library for loading data into data warehouses. Consist integrates with DLT to materialize artifacts into DuckDB with provenance columns (run_id, artifact_id, etc.). Using DLT is optional; you can ingest manually or load files directly.
See also: Ingestion, Materialization
Derivable References¶
The condition where Consist can safely infer relationships between tables during maintenance operations without guessing.
For cache pruning, references are derivable when run-link tables expose both run_id and content_hash, and unscoped cache tables expose content_hash.
When references are not derivable, Consist uses safe skip/no-op behavior instead of deleting data based on uncertain joins.
See also: Run-Link Table, Unscoped Cache Table, Skip/No-op Behavior
Facet¶
A small, queryable subset of configuration (e.g., {"year": 2030, "scenario": "baseline"}) that's indexed in DuckDB so you can filter runs. Use facets when you want to ask "show me all runs where year=2030" without storing the entire 50MB config file.
Unlike identity config (which is hashed into the cache key), facets are stored queryably in the database for filtering and analysis.
Use case: Your config is a 10 MB YAML file (too large to store). You extract {"year": 2030, "parking_cost": 5.0} as facets and query runs by year/parking_cost.
Example:
import consist
from consist import use_tracker
with use_tracker(tracker):
consist.run(
fn=my_model,
config={"huge_model_config": ...}, # Not queryable directly (too large)
facet={"year": 2030, "scenario": "baseline"}, # Indexed and queryable
inputs={...},
outputs=[...]
)
# Later: query all 2030 runs
df = tracker.find_runs(facet_year=2030)
Research example: In a multi-year study, you can set facet={"year": 2030, "scenario": "transit-friendly"}. This allows you to instantly find all 2030 sensitivity tests without searching through directories.
See also: Config, Signature
Ghost Mode¶
The ability to recover artifacts that exist only in the provenance database, not as physical files. If an artifact was ingested, Consist can recover the data from DuckDB even if the original file was deleted. See Architecture: Ghost Mode for details.
See also: Ingestion, Materialization
Hydration¶
Recovering the metadata and location information about a previous run's output without copying the file bytes. On a cache hit, Consist "hydrates" the output artifact so you know where it came from and can access it, but doesn't necessarily copy it to your current run directory.
Hydration ≠ copying files. A hydrated artifact has provenance metadata but may not have file bytes copied to the new run's directory. By default, Consist recovers the information but doesn't copy files (saves disk space). You opt in to copying files when needed.
See also: Materialization, Cache Hit
Identity Config¶
The full set of configuration parameters that affect a run's cache signature; if identity config changes, the run must re-execute. Unlike facets (which are just for querying), identity config is hashed into the cache key to ensure cache invalidation when parameters change.
Example: If your config dict contains {"year": 2030, "mode_choice_coefficient": 0.5}, both values are hashed into the signature. Changing either value invalidates cache.
See also: Config, Facet, Signature
Ingestion¶
Loading artifact data into DuckDB for SQL-native analysis. Optional; you can use Consist without ingesting (just track files).
Use case: You want to query 50 Parquet files across 50 runs in SQL without loading them all into memory.
Process:
1. Artifact is created/logged by a run
2. tracker.ingest(artifact, data=df) stores the data in DuckDB
3. Later: query in SQL across all ingested data
See also: Materialization, Ghost Mode, Hybrid View
Lineage¶
The complete dependency chain showing where a result came from. Lineage tracks: which run created an artifact, which inputs that run used, which runs created those inputs, etc.
Example: consist lineage traffic_volumes shows:
traffic_volumes (artifact)
├── created by: traffic_simulation run
│ ├── input: assigned_trips
│ │ └── created by: assignment run
│ │ └── input: trip_tables
│ │ └── created by: mode_choice run
│ │ └── ...
See also: Provenance, Run, Artifact
Merkle DAG¶
A chain of computations where each step's inputs are linked to the outputs of previous steps, creating an unbreakable record of data lineage. Like a railroad consist (the specific order of locomotives and cars), each simulation year depends on the previous year's output.
Why it matters: This structure enables Consist to detect when cached results are valid (all upstream inputs haven't changed) and to recover missing data if needed.
See also: Signature, Lineage, Artifact
Materialization¶
Saving the actual bytes of a data file into DuckDB (the provenance database) so it's recoverable even if the original file gets deleted. If you ingest a 10GB result file, DuckDB stores a copy so you can retrieve it later without the original file.
Materialization = copying bytes into the database Hydration = recovering metadata without bytes
See also: Hydration, Ingestion, Ghost Mode
Provenance¶
Complete history of where a result came from: code version, configuration, input data, and compute environment. Consist records provenance automatically for every run.
Why it matters: Reproducibility ("Can I re-run this exactly?"), Accountability ("Which config made this figure?"), Debugging ("Why did this change?")
Research example: If a policy maker asks which assumptions led to a specific forecast, you can use provenance to identify the exact code version, zoning policy, and parcel data used. You can then reproduce it exactly or change one parameter to show the impact.
See also: Artifact, Lineage, Signature
Run¶
A single execution of a tracked function or workflow step. A run records: - Input artifacts and configuration - Execution status (completed, failed, cached) - Output artifacts - Timing (start/end times) - Tags and metadata
Example:
import consist
from pathlib import Path
from consist import use_tracker
with use_tracker(tracker):
result = consist.run(
fn=prepare_load_shapes,
inputs={"raw_path": Path("hourly_demand.csv")},
config={"peak_shave_threshold": 0.95},
outputs=["processed_load"],
)
This creates a Run with one input artifact, one config dict, and one output artifact.
Research example: In grid modeling, each annual simulation is a Run. You can query "what was the total compute time across all high-load scenarios?" or trace a reliability violation back to original forecast assumptions.
See also: Artifact, Scenario
Run-Link Table¶
A global_tables.* table classified as run_link because it includes run_id (and not consist_run_id).
Run-link tables are used for run-associated metadata and, when they also include content_hash, can serve as derivable reference sources for safe unscoped cache pruning.
See also: Run-Scoped Table, Unscoped Cache Table, Derivable References
Run-Scoped Table¶
A global_tables.* table classified as run_scoped because it includes consist_run_id.
Run-scoped rows are directly attributable to specific runs and are deleted by consist db purge --delete-ingested-data for selected run IDs.
See also: Run-Link Table, Unscoped Cache Table
Scenario¶
A grouping of related runs. Scenarios are useful for organizing multi-variant studies or iterative workflows.
Example: "baseline_2030" scenario contains 5 related runs: - Year 2030, baseline policy, iteration 0 - Year 2030, baseline policy, iteration 1 - Year 2030, baseline policy, iteration 2 - ...
Important: Consist uses "scenario" differently from policy modeling jargon. In Consist, a scenario is a parent run grouping; in transportation modeling, "baseline scenario" and "growth scenario" are policy variants. Don't confuse the two.
See also: Run, Coupler, Trace
Skip/No-op Behavior¶
Conservative maintenance behavior where Consist intentionally performs no deletion or merge action when safety preconditions are not met.
Examples:
- consist db purge --prune-cache becomes a no-op for unscoped cache pruning when references are not derivable, or when --delete-ingested-data is not enabled.
- consist db merge --conflict skip skips incompatible global tables while still merging compatible data.
This behavior prioritizes correctness and data safety over aggressive cleanup.
See also: Derivable References, Unscoped Cache Table
Signature¶
A unique fingerprint of a run created by hashing together your code version, configuration parameters, and input data. It's like a "run ID" that Consist compares across executions to detect when the same inputs, code, and config have been run before.
How it works:
1. Function code is hashed (git commit SHA + modified files)
2. Config dict is hashed deterministically (canonical hashing)
3. Input file hashes are computed
4. All three are combined: signature = SHA256(code + config + inputs)
5. This signature is the cache key
Why: If you re-run with the same signature, Consist knows the result will be the same, so it returns the cached version instantly. Identical signatures = identical outputs (assuming deterministic functions).
See also: Cache Hit, Config, Artifact, Canonical Hashing
Trace¶
The execution path through a multi-step workflow, showing which runs were executed, which were cache hits, and what artifacts were passed between them.
See also: Scenario, Coupler, Lineage
Unscoped Cache Table¶
A global_tables.* table classified as unscoped_cache because it has neither consist_run_id nor run_id.
These tables often contain shared cache rows keyed by values like content_hash. They are preserved by default in purge/merge/export flows unless safe derivation rules allow targeted pruning/copy behavior.
For --prune-cache, correctness assumes content_hash equivalence across participating derivable run-link and unscoped cache tables.
See also: Run-Scoped Table, Run-Link Table, Derivable References
Virtualization (Data Virtualization)¶
Querying multiple artifacts as if they were a single table, without loading all data into memory. DuckDB handles data movement lazily.
Example: Query 50 Parquet files across 50 runs:
SELECT year, mode, COUNT(*) as trips
FROM consist_view_trips
WHERE scenario IN ('baseline', 'high_growth')
GROUP BY year, mode
Consist creates a virtual SQL view that queries each file as needed, not loading all at once.
See also: Hybrid View, Ingestion
Hybrid View¶
A SQL view that combines: 1. Hot data: Ingested artifacts stored in DuckDB 2. Cold data: Raw files (Parquet, CSV) queried on-the-fly
Hybrid views let you query across runs without requiring all data to be ingested, reducing storage overhead.
See also: Ingestion, Virtualization