Identity Manager¶
Manage the cryptographic identity and Merkle-tree state of simulation workflows.
The IdentityManager is responsible for generating deterministic signatures for Runs and Artifacts, forming the core of Consist's reproducibility engine. By synthesizing code version (Git), configuration parameters, and input provenance into composite SHA256 hashes, it ensures that any divergence in computational logic or data state results in a unique identity.
The primary run signature is defined by the following composition: H_run = SHA256( H_code + H_config + H_inputs )
canonical_json_str(obj)
¶
Return a stable JSON string for hashing/IDs.
Uses _clean_structure to normalize types and then dumps with deterministic
key ordering and compact separators.
canonical_json_sha256(obj)
¶
SHA256 hex digest of canonical_json_str(obj).
normalize_json(obj)
¶
Normalize Python structures into JSON-friendly types.
This mirrors the canonical hashing cleanup but preserves the full structure without excluding any keys.
calculate_run_signature(code_hash, config_hash, input_hash)
¶
Computes the final cryptographic signature (cache key) for a run.
get_code_version()
¶
Retrieves the global 'Code Identity' using the Git Commit SHA.
This uses GitPython directly to avoid subprocess overhead and parsing fragility.
compute_callable_hash(func, strategy='module', extra_deps=None)
¶
Computes a hash for a specific Python function/callable.
This allows for granular caching (ignoring global repo changes) by focusing on the relevant code.
Strategies:
'source':
Hashes ONLY the function's source code (via inspect.getsource).
Use this for pure functions with no external dependencies.
'module':
Hashes the entire file (.py) where the function is defined.
This is the robust "in-between": it captures helper functions and
constants in the same file, but ignores changes in unrelated files.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
func
|
Callable
|
The function to hash. |
required |
strategy
|
str
|
The hashing strategy ("source" or "module"). |
"module"
|
extra_deps
|
List[str]
|
List of additional file paths (relative to project root) that this function depends on. Their content will be mixed into the hash. |
None
|
compute_config_hash(config, exclude_keys=None)
¶
Generate a deterministic cryptographic hash of a configuration structure.
This method implements canonical configuration hashing by normalizing Python dictionaries, lists, and Pydantic models into a stable state. It explicitly addresses the 'NumPy Problem' by converting numerical primitives into native Python types and ensures order-independence through recursive key sorting.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
config
|
Dict[str, Any]
|
The configuration dictionary or Pydantic model to hash. |
required |
exclude_keys
|
Optional[List[str]]
|
A collection of keys to be omitted from the identity calculation (e.g., non-deterministic timestamps or local file paths). |
None
|
Returns:
| Type | Description |
|---|---|
str
|
A SHA256 hex digest representing the canonical configuration identity. |
compute_run_config_hash(*, config, model, year=None, iteration=None, cache_epoch=None, cache_version=None)
¶
Compute a config hash for a run, mixing in identity-relevant run fields.
Tracker persists config for human inspection, but caching identity needs to
include some run context fields that are frequently semantically relevant,
such as year, iteration, and cache versioning.
compute_input_hash(inputs, path_resolver=None, signature_lookup=None)
¶
Synthesize a deterministic hash representing the aggregate state of all input artifacts.
This calculation is fundamental to the construction of the Merkle DAG. It incorporates the unique identities of all inputs to ensure that upstream data mutations correctly invalidate downstream caches.
The identity of an input is determined by its provenance: 1. Managed Artifacts: If the artifact was produced by a previous Consist run, its identity is derived from the producing run's cryptographic signature. 2. Exogenous Files: If the input is a raw file, its identity is established through a physical content or metadata hash of the filesystem object.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
inputs
|
List[Artifact]
|
A collection of Artifact instances representing the run's dependencies. |
required |
path_resolver
|
Optional[Callable[[str], str]]
|
A function to resolve virtualized URIs to absolute filesystem paths, required for hashing exogenous files. |
None
|
signature_lookup
|
Optional[Callable[[str], Optional[str]]]
|
A function to retrieve the run signatures of producing runs, facilitating Merkle-link construction. |
None
|
Returns:
| Type | Description |
|---|---|
str
|
A SHA256 hex digest representing the combined input identity. |
Raises:
| Type | Description |
|---|---|
ValueError
|
If an exogenous file requires hashing but no path_resolver is provided. |
compute_file_checksum(file_path)
¶
Computes a cryptographic identifier for a given file or directory based on the configured hashing strategy.
This method is critical for establishing the unique identity of raw file-based inputs to a Consist run. It supports two main strategies: 'full' (content-based) and 'fast' (metadata-based), and handles both single files and directories.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
file_path
|
str
|
The absolute path to the file or directory for which to compute the checksum. |
required |
Returns:
| Type | Description |
|---|---|
str
|
A SHA256 hex digest representing the checksum or identity of the file/directory. |
Raises:
| Type | Description |
|---|---|
FileNotFoundError
|
If the specified |
Warns:
| Type | Description |
|---|---|
UserWarning
|
If 'full' content hashing is performed on a directory, as this can be computationally expensive for large directories. |
label_for_hash_input(path)
¶
Create a stable, human-friendly label for a hash input path.
This is used when recording inputs that are represented only by their
hash (e.g., "hash-only" config inputs). The method prefers a path that
is relative to project_root for readability and portability, and
falls back to the original string if it cannot be made relative.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
path
|
Union[str, Path]
|
A file or directory path used as a hash input. |
required |
Returns:
| Type | Description |
|---|---|
str
|
A string label suitable for logs and provenance records. |
digest_path(path, *, ignore_dotfiles=True, allowlist=None, hashing_strategy_override=None)
¶
Digest a file or directory with optional filtering.
- Files: delegated to
compute_file_checksum(honors hashing_strategy). - Directories: deterministic digest over relative paths + (content or metadata).
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
path
|
Union[str, Path]
|
The filesystem path to be digested. If a directory is provided, the method computes an aggregate identity across all contained files based on the active hashing strategy. |
required |
ignore_dotfiles
|
bool
|
If True, ignore any file whose relative path includes a component starting with '.'. |
True
|
allowlist
|
Optional[List[str]]
|
If provided, only include files whose relative path matches at least one glob pattern. |
None
|
compute_hash_inputs_digests(hash_inputs, *, ignore_dotfiles=True, allowlist=None)
¶
Compute digests for external "hash-only" config inputs (files or directories).
Items may be: - A path (str/Path): label derived from project-relative path when possible. - A (label, path) tuple: explicit label.