Skip to content

Data Virtualization & Materialization Strategy

This guide explains when to keep data on disk (virtualize) versus ingesting it into DuckDB (materialize).

Canonical one-line definitions for cold/hot data, hybrid views, and materialization live in Core Concepts Overview. This page focuses on decision criteria and implementation patterns.

Recommended path

For standard execution flow, the recommended path is consist.run(...), consist.trace(...), or consist.scenario(...). The ingestion snippets below use low-level lifecycle APIs (tracker.start_run(...), tracker.log_artifact(...)) because ingestion is attached to explicit artifact logging.


Why the Distinction Matters

Simulation workflows produce many artifacts across runs. Comparing results requires querying across runs—but ingesting every output doubles your database size. The cold/hot distinction lets you choose: keep storage efficient for artifacts you rarely query, enable SQL analysis for artifacts you compare frequently.

The tradeoff: ingested data is queryable and recoverable; cold data requires keeping original files but adds no database overhead.


Two Storage Modes

Cold Data: File-backed artifacts with provenance metadata but no SQL query surface until ingested.

Hot Data: Ingested artifacts in DuckDB for SQL analytics, indexing, and optional recovery when source files are missing.


When to Ingest

Ingest when you need:

  • SQL-native analytics over many runs
  • Schema tracking and exportable SQLModel stubs
  • Recovery of tabular data if files are missing
  • Cross-run comparisons via unified SQL views

Ingest key tabular outputs you compare across runs:

  • Transportation: Trip tables, skim matrices, mode choice summaries
  • Electric grid: Aggregated metrics (monthly averages, distribution-level summaries)
  • Urban planning: Parcel or zoning summaries, development capacity estimates

Ingestion is appropriate for outputs where (1) you need SQL queries across runs, and (2) the storage cost is acceptable—ingestion approximately doubles storage for that artifact. For a 500MB output compared across 100 runs, ingestion adds ~50GB to the database.


When to Stay Cold

Keep data as cold files when:

  • Storage duplication is unacceptable (multi-GB rasters, video files, binary blobs)
  • Data already lives in an external system
  • No cross-run SQL analysis is needed

Cold data lives only on disk; provenance metadata is tracked but data is not queryable via SQL or included in hybrid views until ingested.


Ingesting Artifacts

Advanced lifecycle examples

The following snippets intentionally show explicit run lifecycle calls for ingestion control. Keep regular compute steps on the recommended path (run/trace/scenario).

Install the optional DLT dependency:

pip install "consist[ingest]"

Basic usage:

from consist import Tracker

tracker = Tracker(run_dir="./runs", db_path="./provenance.duckdb")

with tracker.start_run("ingest_example", model="demo"):
    artifact = tracker.log_artifact("data/people.csv", key="people", driver="csv")
    tracker.ingest(artifact)

Ingest explicit data payloads (dicts, DataFrames, generators):

with tracker.start_run("ingest_inline", model="demo"):
    artifact = tracker.log_artifact("inputs/people.json", key="people")
    tracker.ingest(artifact, data=[{"id": 1, "name": "A"}])

Strict Schema Ingestion

Provide a SQLModel schema to validate data:

from sqlmodel import SQLModel, Field

class Person(SQLModel, table=True):
    __tablename__ = "people"
    id: int = Field(primary_key=True)
    name: str

with tracker.start_run("strict_ingest", model="demo"):
    artifact = tracker.log_artifact("data/people.csv", key="people", schema=Person)
    tracker.ingest(artifact, schema=Person)

Strict mode rejects incompatible data rather than silently coercing it.


Provenance Columns

Ingested tables include system columns:

Column Purpose
consist_run_id Run that created this row
consist_artifact_id Source artifact
consist_year, consist_iteration Scenario dimensions (optional)

These enable joins across runs and filtering by scenario or year in hybrid views.


Hybrid Views

Hybrid views are SQL views that union hot data (ingested into DuckDB) with cold data (read from files on disk). Query both as one logical table without requiring all data in the database.

Register a SQLModel schema to activate a view:

from sqlmodel import SQLModel, Field

class Person(SQLModel, table=True):
    __tablename__ = "people"
    person_id: int = Field(primary_key=True)
    age: int

tracker = Tracker(
    run_dir="./runs",
    db_path="./provenance.duckdb",
    schemas=[Person],
)

VPerson = tracker.views.Person

Once registered, you can query across all runs (both ingested and cold data):

from sqlmodel import select, func
import consist

query = (
    select(VPerson.consist_year, func.avg(VPerson.age).label("avg_age"))
    .group_by(VPerson.consist_year)
)
rows = consist.run_query(query, tracker=tracker)

For schema/facet-driven cross-key analysis, use Grouped Views to create one view over many artifacts that share a schema identity instead of a single concept_key.


Loading Behavior & Database Fallback

consist.load(...) reads from the filesystem. If files are missing and the artifact was ingested, database fallback recovers data:

Policy Behavior
inputs-only (default) Recover inside active run for declared inputs only
always Recover whenever artifact is ingested
never Disable recovery

Example:

df = consist.load_df(artifact, tracker=tracker, db_fallback="always")

Loader Kwargs

consist.load(...) validates loader kwargs and raises ValueError for unsupported options.

Driver Supported kwargs
parquet columns
csv columns, delimiter, header (alias: sep)
json orient, dtype, convert_axes, convert_dates, precise_float, date_unit, encoding, lines, compression, typ
h5_table columns, where, start, stop

Use consist.load_relation(...) to manage DuckDB Relation lifecycle explicitly. Use consist.load_df(...) for DataFrame with automatic cleanup.


Schema Export Workflow

For curated schemas with explicit PK/FK constraints: ingest tabular data, run schema export to generate a SQLModel stub, then edit and register the stub for views. See schema-export.md for details.


HDF5 Native Support Roadmap

Currently h5_table uses a staging bridge (pandas.read_hdf(...) + conn.from_df(...)). Native HDF5 support will replace this with DuckDB's HDF5 extension for direct queries.

Remaining work: switch h5_table to hdf5_read(...) once stable, validate extension version pinning, handle PyTables/object dtype edge cases, update schema capture for native relation path, add performance testing.

Benefits: zero-copy streaming reads, lower memory usage, consistent SQL workflows across Parquet/CSV/HDF5.


Decision Summary

Need Approach When
Query results across runs in SQL Ingest Cross-run comparisons, storage cost acceptable
Minimize database size Stay cold Large files, no cross-run SQL analysis
Query ingested and raw files together Hybrid views Mixed—some runs ingested, some not
Recover data if files deleted Ingest Need resilience to file deletion

See Also