Skip to content

Identity Manager

Manage the cryptographic identity and Merkle-tree state of simulation workflows.

The IdentityManager is responsible for generating deterministic signatures for Runs and Artifacts, forming the core of Consist's reproducibility engine. By synthesizing code version (Git), configuration parameters, and input provenance into composite SHA256 hashes, it ensures that any divergence in computational logic or data state results in a unique identity.

The primary run signature is defined by the following composition: H_run = SHA256( H_code + H_config + H_inputs )

canonical_json_str(obj)

Return a stable JSON string for hashing/IDs.

Uses _clean_structure to normalize types and then dumps with deterministic key ordering and compact separators.

canonical_json_sha256(obj)

SHA256 hex digest of canonical_json_str(obj).

normalize_json(obj)

Normalize Python structures into JSON-friendly types.

This mirrors the canonical hashing cleanup but preserves the full structure without excluding any keys.

calculate_run_signature(code_hash, config_hash, input_hash)

Computes the final cryptographic signature (cache key) for a run.

get_code_version()

Retrieves the global 'Code Identity' using the Git Commit SHA.

This uses GitPython directly to avoid subprocess overhead and parsing fragility.

compute_callable_hash(func, strategy='module', extra_deps=None)

Computes a hash for a specific Python function/callable.

This allows for granular caching (ignoring global repo changes) by focusing on the relevant code.

Strategies:

'source': Hashes ONLY the function's source code (via inspect.getsource). Use this for pure functions with no external dependencies. 'module': Hashes the entire file (.py) where the function is defined. This is the robust "in-between": it captures helper functions and constants in the same file, but ignores changes in unrelated files.

Parameters:

Name Type Description Default
func Callable

The function to hash.

required
strategy str

The hashing strategy ("source" or "module").

"module"
extra_deps List[str]

List of additional file paths (relative to project root) that this function depends on. Their content will be mixed into the hash.

None

compute_config_hash(config, exclude_keys=None)

Generate a deterministic cryptographic hash of a configuration structure.

This method implements canonical configuration hashing by normalizing Python dictionaries, lists, and Pydantic models into a stable state. It explicitly addresses the 'NumPy Problem' by converting numerical primitives into native Python types and ensures order-independence through recursive key sorting.

Parameters:

Name Type Description Default
config Dict[str, Any]

The configuration dictionary or Pydantic model to hash.

required
exclude_keys Optional[List[str]]

A collection of keys to be omitted from the identity calculation (e.g., non-deterministic timestamps or local file paths).

None

Returns:

Type Description
str

A SHA256 hex digest representing the canonical configuration identity.

compute_run_config_hash(*, config, model, year=None, iteration=None, cache_epoch=None, cache_version=None)

Compute a config hash for a run, mixing in identity-relevant run fields.

Tracker persists config for human inspection, but caching identity needs to include some run context fields that are frequently semantically relevant, such as year, iteration, and cache versioning.

compute_input_hash(inputs, path_resolver=None, signature_lookup=None)

Synthesize a deterministic hash representing the aggregate state of all input artifacts.

This calculation is fundamental to the construction of the Merkle DAG. It incorporates the unique identities of all inputs to ensure that upstream data mutations correctly invalidate downstream caches.

The identity of an input is determined by its provenance: 1. Managed Artifacts: If the artifact was produced by a previous Consist run, its identity is derived from the producing run's cryptographic signature. 2. Exogenous Files: If the input is a raw file, its identity is established through a physical content or metadata hash of the filesystem object.

Parameters:

Name Type Description Default
inputs List[Artifact]

A collection of Artifact instances representing the run's dependencies.

required
path_resolver Optional[Callable[[str], str]]

A function to resolve virtualized URIs to absolute filesystem paths, required for hashing exogenous files.

None
signature_lookup Optional[Callable[[str], Optional[str]]]

A function to retrieve the run signatures of producing runs, facilitating Merkle-link construction.

None

Returns:

Type Description
str

A SHA256 hex digest representing the combined input identity.

Raises:

Type Description
ValueError

If an exogenous file requires hashing but no path_resolver is provided.

compute_file_checksum(file_path)

Computes a cryptographic identifier for a given file or directory based on the configured hashing strategy.

This method is critical for establishing the unique identity of raw file-based inputs to a Consist run. It supports two main strategies: 'full' (content-based) and 'fast' (metadata-based), and handles both single files and directories.

Parameters:

Name Type Description Default
file_path str

The absolute path to the file or directory for which to compute the checksum.

required

Returns:

Type Description
str

A SHA256 hex digest representing the checksum or identity of the file/directory.

Raises:

Type Description
FileNotFoundError

If the specified file_path does not exist on the filesystem.

Warns:

Type Description
UserWarning

If 'full' content hashing is performed on a directory, as this can be computationally expensive for large directories.

label_for_hash_input(path)

Create a stable, human-friendly label for a hash input path.

This is used when recording inputs that are represented only by their hash (e.g., "hash-only" config inputs). The method prefers a path that is relative to project_root for readability and portability, and falls back to the original string if it cannot be made relative.

Parameters:

Name Type Description Default
path Union[str, Path]

A file or directory path used as a hash input.

required

Returns:

Type Description
str

A string label suitable for logs and provenance records.

digest_path(path, *, ignore_dotfiles=True, allowlist=None, hashing_strategy_override=None)

Digest a file or directory with optional filtering.

  • Files: delegated to compute_file_checksum (honors hashing_strategy).
  • Directories: deterministic digest over relative paths + (content or metadata).

Parameters:

Name Type Description Default
path Union[str, Path]

The filesystem path to be digested. If a directory is provided, the method computes an aggregate identity across all contained files based on the active hashing strategy.

required
ignore_dotfiles bool

If True, ignore any file whose relative path includes a component starting with '.'.

True
allowlist Optional[List[str]]

If provided, only include files whose relative path matches at least one glob pattern.

None

compute_hash_inputs_digests(hash_inputs, *, ignore_dotfiles=True, allowlist=None)

Compute digests for external "hash-only" config inputs (files or directories).

Items may be: - A path (str/Path): label derived from project-relative path when possible. - A (label, path) tuple: explicit label.