DLT Loader¶

Requires the optional ingest extra before using DLT features: pip install "consist[ingest]".

Consist dlt (Data Load Tool) Integration Module

This module provides the integration layer between Consist and the dlt library, facilitating the robust and efficient ingestion of artifact data into the DuckDB database. It is responsible for materializing various data formats (e.g., Pandas DataFrames, Parquet, CSV, Zarr metadata) and ensuring that Consist's system-level provenance columns (such as consist_run_id, consist_artifact_id) are correctly injected into the data.

Key functionalities include: - Dynamic Schema Extension: User-defined SQLModel schemas are dynamically extended with Consist's provenance-tracking system columns. - Flexible Ingestion Strategies: Supports different data ingestion mechanisms, including vectorized loading (for Pandas DataFrames, PyArrow tables) and streaming for large datasets. - Format-Specific Handlers: Contains specialized functions for processing and preparing data from common file formats like Parquet, CSV, and extracting structural metadata from Zarr archives. - dlt Pipeline Integration: Leverages the dlt pipeline for robust data loading, automatic schema inference, and optional strict validation, ensuring data quality and consistency.

`ingest_artifact(artifact, run_context, db_path, data_iterable=None, schema_model=None, lock_retries=20, lock_base_sleep_seconds=0.1, lock_max_sleep_seconds=2.0)` ¶

Ingests artifact data into a DuckDB database using the dlt (Data Load Tool) library.

This function supports various data sources (file paths, Pandas DataFrames, iterables of dicts) and automatically injects Consist's provenance system columns (consist_run_id, consist_artifact_id, consist_year, consist_iteration) into the data. It leverages dlt for robust schema handling, including inference and optional strict validation based on a provided SQLModel.

Parameters:

Name	Type	Description	Default
`artifact`	`Artifact`	The Consist `Artifact` object representing the data to be ingested. Its driver information is used to determine the appropriate data handler.	required
`run_context`	`Run`	The `Run` object providing the context (ID, year, iteration) for provenance tracking.	required
`db_path`	`str`	The file system path to the DuckDB database where the data will be loaded.	required
`data_iterable`	`Optional[Union[Iterable[Any], str, DataFrame]]`	The data to ingest. Can be: - A file path (str) to a Parquet, CSV, HDF5, JSON, or Zarr file. - A Pandas DataFrame (will be treated as a single batch). - An iterable (e.g., list of dicts, generator) where each item represents a row. If `None`, it implies the data should be read directly from the `artifact`'s URI.	`None`
`schema_model`	`Optional[Type[SQLModel]]`	An optional `SQLModel` class that defines the expected schema for the data. If provided, `dlt` will use this for strict validation and schema management. If `None`, `dlt` will infer the schema.	`None`

Returns:

Type	Description
`Tuple[LoadInfo, str]`	A tuple containing: - `dlt.LoadInfo`: An object providing detailed information about the data loading process. - `str`: The actual normalized table name where the data was loaded in the database.

Raises:

Type	Description
`ValueError`	If no data is provided for ingestion, if the artifact driver is unsupported, or if a `schema_model` is provided but a schema contract violation occurs (e.g., new columns found in strict mode).
`ImportError`	If a required library for a specific driver (e.g., `pyarrow` for Parquet, `tables` for HDF5, `xarray`/`zarr` for Zarr) is not installed.

DLT Loader¶

ingest_artifact(artifact, run_context, db_path, data_iterable=None, schema_model=None, lock_retries=20, lock_base_sleep_seconds=0.1, lock_max_sleep_seconds=2.0) ¶

`ingest_artifact(artifact, run_context, db_path, data_iterable=None, schema_model=None, lock_retries=20, lock_base_sleep_seconds=0.1, lock_max_sleep_seconds=2.0)` ¶