DLT Loader¶
Requires the optional ingest extra before using DLT features:
pip install "consist[ingest]".
Consist dlt (Data Load Tool) Integration Module
This module provides the integration layer between Consist and the dlt library,
facilitating the robust and efficient ingestion of artifact data into the DuckDB database.
It is responsible for materializing various data formats (e.g., Pandas DataFrames,
Parquet, CSV, Zarr metadata) and ensuring that Consist's system-level provenance columns
(such as consist_run_id, consist_artifact_id) are correctly injected into the data.
Key functionalities include:
- Dynamic Schema Extension: User-defined SQLModel schemas are dynamically extended
with Consist's provenance-tracking system columns.
- Flexible Ingestion Strategies: Supports different data ingestion mechanisms,
including vectorized loading (for Pandas DataFrames, PyArrow tables) and streaming
for large datasets.
- Format-Specific Handlers: Contains specialized functions for processing and
preparing data from common file formats like Parquet, CSV, and extracting
structural metadata from Zarr archives.
- dlt Pipeline Integration: Leverages the dlt pipeline for robust data loading,
automatic schema inference, and optional strict validation, ensuring data quality
and consistency.
ingest_artifact(artifact, run_context, db_path, data_iterable=None, schema_model=None, lock_retries=20, lock_base_sleep_seconds=0.1, lock_max_sleep_seconds=2.0)
¶
Ingests artifact data into a DuckDB database using the dlt (Data Load Tool) library.
This function supports various data sources (file paths, Pandas DataFrames, iterables of dicts)
and automatically injects Consist's provenance system columns (consist_run_id,
consist_artifact_id, consist_year, consist_iteration) into the data. It leverages
dlt for robust schema handling, including inference and optional strict validation
based on a provided SQLModel.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
artifact
|
Artifact
|
The Consist |
required |
run_context
|
Run
|
The |
required |
db_path
|
str
|
The file system path to the DuckDB database where the data will be loaded. |
required |
data_iterable
|
Optional[Union[Iterable[Any], str, DataFrame]]
|
The data to ingest. Can be:
- A file path (str) to a Parquet, CSV, HDF5, JSON, or Zarr file.
- A Pandas DataFrame (will be treated as a single batch).
- An iterable (e.g., list of dicts, generator) where each item represents a row.
If |
None
|
schema_model
|
Optional[Type[SQLModel]]
|
An optional |
None
|
Returns:
| Type | Description |
|---|---|
Tuple[LoadInfo, str]
|
A tuple containing:
- |
Raises:
| Type | Description |
|---|---|
ValueError
|
If no data is provided for ingestion, if the artifact driver is unsupported,
or if a |
ImportError
|
If a required library for a specific driver (e.g., |