Skip to content

DLT Loader

Requires the optional ingest extra before using DLT features: pip install "consist[ingest]".

Consist dlt (Data Load Tool) Integration Module

This module provides the integration layer between Consist and the dlt library, facilitating the robust and efficient ingestion of artifact data into the DuckDB database. It is responsible for materializing various data formats (e.g., Pandas DataFrames, Parquet, CSV, Zarr metadata) and ensuring that Consist's system-level provenance columns (such as consist_run_id, consist_artifact_id) are correctly injected into the data.

Key functionalities include: - Dynamic Schema Extension: User-defined SQLModel schemas are dynamically extended with Consist's provenance-tracking system columns. - Flexible Ingestion Strategies: Supports different data ingestion mechanisms, including vectorized loading (for Pandas DataFrames, PyArrow tables) and streaming for large datasets. - Format-Specific Handlers: Contains specialized functions for processing and preparing data from common file formats like Parquet, CSV, and extracting structural metadata from Zarr archives. - dlt Pipeline Integration: Leverages the dlt pipeline for robust data loading, automatic schema inference, and optional strict validation, ensuring data quality and consistency.

ingest_artifact(artifact, run_context, db_path, data_iterable=None, schema_model=None, lock_retries=20, lock_base_sleep_seconds=0.1, lock_max_sleep_seconds=2.0)

Ingests artifact data into a DuckDB database using the dlt (Data Load Tool) library.

This function supports various data sources (file paths, Pandas DataFrames, iterables of dicts) and automatically injects Consist's provenance system columns (consist_run_id, consist_artifact_id, consist_year, consist_iteration) into the data. It leverages dlt for robust schema handling, including inference and optional strict validation based on a provided SQLModel.

Parameters:

Name Type Description Default
artifact Artifact

The Consist Artifact object representing the data to be ingested. Its driver information is used to determine the appropriate data handler.

required
run_context Run

The Run object providing the context (ID, year, iteration) for provenance tracking.

required
db_path str

The file system path to the DuckDB database where the data will be loaded.

required
data_iterable Optional[Union[Iterable[Any], str, DataFrame]]

The data to ingest. Can be: - A file path (str) to a Parquet, CSV, HDF5, JSON, or Zarr file. - A Pandas DataFrame (will be treated as a single batch). - An iterable (e.g., list of dicts, generator) where each item represents a row. If None, it implies the data should be read directly from the artifact's URI.

None
schema_model Optional[Type[SQLModel]]

An optional SQLModel class that defines the expected schema for the data. If provided, dlt will use this for strict validation and schema management. If None, dlt will infer the schema.

None

Returns:

Type Description
Tuple[LoadInfo, str]

A tuple containing: - dlt.LoadInfo: An object providing detailed information about the data loading process. - str: The actual normalized table name where the data was loaded in the database.

Raises:

Type Description
ValueError

If no data is provided for ingestion, if the artifact driver is unsupported, or if a schema_model is provided but a schema contract violation occurs (e.g., new columns found in strict mode).

ImportError

If a required library for a specific driver (e.g., pyarrow for Parquet, tables for HDF5, xarray/zarr for Zarr) is not installed.