Data Virtualization & Materialization Strategy¶

This guide explains when to keep data on disk (virtualize) versus ingesting it into DuckDB (materialize).

Canonical one-line definitions for cold/hot data, hybrid views, and materialization live in Core Concepts Overview. This page focuses on decision criteria and implementation patterns.

Recommended path

For standard execution flow, the recommended path is consist.run(...), consist.trace(...), or consist.scenario(...). The ingestion snippets below use low-level lifecycle APIs (tracker.start_run(...), tracker.log_artifact(...)) because ingestion is attached to explicit artifact logging.

Why the Distinction Matters¶

Simulation workflows produce many artifacts across runs. Comparing results requires querying across runs—but ingesting every output doubles your database size. The cold/hot distinction lets you choose: keep storage efficient for artifacts you rarely query, enable SQL analysis for artifacts you compare frequently.

The tradeoff: ingested data is queryable and recoverable; cold data requires keeping original files but adds no database overhead.

Two Storage Modes¶

Cold Data: File-backed artifacts with provenance metadata but no SQL query surface until ingested.

Hot Data: Ingested artifacts in DuckDB for SQL analytics, indexing, and optional recovery when source files are missing.

When to Ingest¶

Ingest when you need:

SQL-native analytics over many runs
Schema tracking and exportable SQLModel stubs
Recovery of tabular data if files are missing
Cross-run comparisons via unified SQL views

Ingest key tabular outputs you compare across runs:

Transportation: Trip tables, skim matrices, mode choice summaries
Electric grid: Aggregated metrics (monthly averages, distribution-level summaries)
Urban planning: Parcel or zoning summaries, development capacity estimates

Ingestion is appropriate for outputs where (1) you need SQL queries across runs, and (2) the storage cost is acceptable—ingestion approximately doubles storage for that artifact. For a 500MB output compared across 100 runs, ingestion adds ~50GB to the database.

When to Stay Cold¶

Keep data as cold files when:

Storage duplication is unacceptable (multi-GB rasters, video files, binary blobs)
Data already lives in an external system
No cross-run SQL analysis is needed

Cold data lives only on disk; provenance metadata is tracked but data is not queryable via SQL or included in hybrid views until ingested.

Ingesting Artifacts¶

Advanced lifecycle examples

The following snippets intentionally show explicit run lifecycle calls for ingestion control. Keep regular compute steps on the recommended path (run/trace/scenario).

Install the optional DLT dependency:

pip install "consist[ingest]"

Basic usage:

from consist import Tracker

tracker = Tracker(run_dir="./runs", db_path="./provenance.duckdb")

with tracker.start_run("ingest_example", model="demo"):
    artifact = tracker.log_artifact("data/people.csv", key="people", driver="csv")
    tracker.ingest(artifact)

Ingest explicit data payloads (dicts, DataFrames, generators):

with tracker.start_run("ingest_inline", model="demo"):
    artifact = tracker.log_artifact("inputs/people.json", key="people")
    tracker.ingest(artifact, data=[{"id": 1, "name": "A"}])

Strict Schema Ingestion¶

Provide a SQLModel schema to validate data:

from sqlmodel import SQLModel, Field

class Person(SQLModel, table=True):
    __tablename__ = "people"
    id: int = Field(primary_key=True)
    name: str

with tracker.start_run("strict_ingest", model="demo"):
    artifact = tracker.log_artifact("data/people.csv", key="people", schema=Person)
    tracker.ingest(artifact, schema=Person)

Strict mode rejects incompatible data rather than silently coercing it.

Provenance Columns¶

Ingested tables include system columns:

Column	Purpose
`consist_run_id`	Run that created this row
`consist_artifact_id`	Source artifact
`consist_year`, `consist_iteration`	Scenario dimensions (optional)

These enable joins across runs and filtering by scenario or year in hybrid views.

Hybrid Views¶

Hybrid views are SQL views that union hot data (ingested into DuckDB) with cold data (read from files on disk). Query both as one logical table without requiring all data in the database.

Register a SQLModel schema to activate a view:

from sqlmodel import SQLModel, Field

class Person(SQLModel, table=True):
    __tablename__ = "people"
    person_id: int = Field(primary_key=True)
    age: int

tracker = Tracker(
    run_dir="./runs",
    db_path="./provenance.duckdb",
    schemas=[Person],
)

VPerson = tracker.views.Person

Once registered, you can query across all runs (both ingested and cold data):

from sqlmodel import select, func
import consist

query = (
    select(VPerson.consist_year, func.avg(VPerson.age).label("avg_age"))
    .group_by(VPerson.consist_year)
)
rows = consist.run_query(query, tracker=tracker)

For schema/facet-driven cross-key analysis, use Grouped Views to create one view over many artifacts that share a schema identity instead of a single concept_key.

Loading Behavior & Database Fallback¶

consist.load(...) reads from the filesystem. If files are missing and the artifact was ingested, database fallback recovers data:

Policy	Behavior
`inputs-only` (default)	Recover inside active run for declared inputs only
`always`	Recover whenever artifact is ingested
`never`	Disable recovery

Example:

df = consist.load_df(artifact, tracker=tracker, db_fallback="always")

Loader Kwargs¶

consist.load(...) validates loader kwargs and raises ValueError for unsupported options.

Driver	Supported kwargs
`parquet`	`columns`
`csv`	`columns`, `delimiter`, `header` (alias: `sep`)
`json`	`orient`, `dtype`, `convert_axes`, `convert_dates`, `precise_float`, `date_unit`, `encoding`, `lines`, `compression`, `typ`
`h5_table`	`columns`, `where`, `start`, `stop`

Use consist.load_relation(...) to manage DuckDB Relation lifecycle explicitly. Use consist.load_df(...) for DataFrame with automatic cleanup.

Schema Export Workflow¶

For curated schemas with explicit PK/FK constraints: ingest tabular data, run schema export to generate a SQLModel stub, then edit and register the stub for views. See schema-export.md for details.

HDF5 Native Support Roadmap¶

Currently h5_table uses a staging bridge (pandas.read_hdf(...) + conn.from_df(...)). Native HDF5 support will replace this with DuckDB's HDF5 extension for direct queries.

Remaining work: switch h5_table to hdf5_read(...) once stable, validate extension version pinning, handle PyTables/object dtype edge cases, update schema capture for native relation path, add performance testing.

Benefits: zero-copy streaming reads, lower memory usage, consistent SQL workflows across Parquet/CSV/HDF5.

Decision Summary¶

Need	Approach	When
Query results across runs in SQL	Ingest	Cross-run comparisons, storage cost acceptable
Minimize database size	Stay cold	Large files, no cross-run SQL analysis
Query ingested and raw files together	Hybrid views	Mixed—some runs ingested, some not
Recover data if files deleted	Ingest	Need resilience to file deletion