Mechanism

Provenance tracing is the training-time recording mechanism by which the semantic execution substrate, operating within the training loop, captures a structured trail for every training iteration. The disclosure positions the substrate at the boundary between the forward-pass loss computation and the backward-pass gradient application, where it evaluates each training batch, or each training example within a batch, for admissibility before the example's contribution is permitted to affect the model's parameters. Each admission, each rejection, and each depth-profile modulation that the substrate renders is written to a training provenance log. The log is the training-time analog of the lineage field maintained for semantic agents, for inference processes, and for discovery traversals elsewhere in the platform: it transforms training from an opaque optimization procedure, in which the relationship between training data and model behavior is unknowable after the fact, into an auditable, traceable, governance-verifiable process.

The substrate does not alter the mathematical machinery of gradient computation or optimizer updates. It governs which gradient signals reach which layers and with what magnitude, and it records what it governed. Every influence pathway between training content and the model's layer structure is therefore reconstructable from the log, because the depth profile applied to each example, the per-layer contribution weight that actually reached each block, and the policy object that authorized the example are all written down at the moment the decision is made.

What the Log Records

For each training batch or training example, the training provenance log records a defined set of structured data elements. These comprise: an entropy band classification indicating the semantic complexity and information density of the content as determined by the platform's entropy extraction pipeline; a slope position indicating the content's position within the platform's trust-slope hierarchy; a depth aggregation profile, namely the per-block contribution weight vector that was applied to the example's gradient signal; a per-layer contribution weight recording the actual gradient magnitude that reached each layer block after depth-selective modulation, accounting for any dynamic adjustment made by the profile adaptation engine; a governance record identifying the policy object that authorized the example's admission and the policy object that determined its depth profile; a content provenance record identifying the source, acquisition pathway, chain of custody, and semantic metadata of the content; and an admissibility determination record indicating whether the example was admitted, rejected, or admitted with a modified depth profile, together with the reason for any modification or rejection.

Because the depth profile and the realized contribution weights are part of every entry, the log preserves not only whether a given piece of content was used but how deeply it was integrated and where in the model's layer structure its influence was permitted to reach.

Append-Only, Tamper-Resistant Structure

The training provenance log is structured as a chronologically ordered, append-only record. Each entry is timestamped, sequentially numbered, and annotated with the training epoch, iteration, and batch index at which the entry was generated. The append-only structure makes the log tamper-resistant: entries cannot be retroactively modified, deleted, or reordered without producing detectable inconsistencies in the sequential numbering and timestamp sequence. The log may be periodically sealed using the cryptographic sealing infrastructure disclosed in the cross-referenced governance disclosure, producing tamper-evident checkpoints that enable third-party verification of the log's integrity. The disclosure does not specify a particular ledger structure or external anchoring scheme beyond this sealing; the integrity guarantee derives from the append-only ordering and the optional sealing rather than from any specific cryptographic construction.

Forward and Reverse Provenance Queries

The log supports post-training provenance queries that reconstruct influence pathways between training content and model capabilities, and the disclosure defines exactly two query forms. A forward query begins with a training example or a class of training content and traces the depth profile, contribution weights, and governance decisions that governed that content's integration into the model, producing a record of which layer blocks were influenced by the content and with what magnitude. A reverse query begins with a model behavior or capability observed at inference time and traces backward through the log to identify the training content whose depth profiles encompassed the layer blocks that are active during the observed behavior.

The disclosure is explicit about the limit of the reverse query: it does not definitively attribute a model behavior to specific training content, because the non-linear dynamics of gradient-based optimization preclude exact attribution. What the reverse query produces is the set of training content that was structurally permitted to influence the relevant layer blocks, a bounded attribution set that is substantially narrower than the full training corpus. The narrowing is a direct consequence of depth-selective aggregation: content excluded from a block by a zero contribution weight cannot have influenced that block, so the reverse query can rule it out.

Content Anchoring of Provenance Records

The content provenance record within the log operates in conjunction with content anchoring, a mechanism by which content derives computable identity from its own structural entropy rather than from externally attached metadata, watermarks, or registry entries. When training content enters the substrate, the substrate evaluates the content's structural entropy signature to determine whether the content has a verifiable anchored identity, namely an identity derived from the content's own structural properties that persists across format conversions, transformations, and platform boundaries. Content with a verified anchored identity receives enriched provenance records that include the anchor identity, enabling later reverse queries to trace model capabilities back to specific anchored content regardless of how the content was acquired or transformed before entering the training pipeline. Content without a verified anchored identity is flagged in the log as provenance-incomplete, and governance policy may restrict the depth profile for such content to shallow layers, preventing deep integration of content whose origin and chain of custody cannot be structurally verified.

Provenance-Driven Memorization Detection

The provenance log and the depth-selective aggregation records together enable a training-level memorization detection mechanism. When model output at inference time is flagged, by the rights-grade governance layer, by an external content identification service, or by a human reviewer, as exhibiting high similarity to a known training artifact, the memorization detection module initiates a reverse provenance query. The query identifies the training examples that correspond to the flagged artifact and retrieves their depth-aggregation profiles, per-layer contribution weights, and governance records, allowing the system to determine which layer blocks the similar content was permitted to influence, the magnitude that reached each block, the entropy band and policy scope under which the content was admitted, and whether it was admitted under a suppressed or a full-depth profile.

The module produces a structured assessment that classifies the similarity into one of three categories named in the disclosure. Shallow memorization means the content was trained with a suppressed depth profile confining its influence to shallower layers, consistent with proper governance of time-limited or rights-restricted content. Deep memorization means the content was trained with a full-depth or deep-weighted profile, which may reflect a policy-compliant deep integration of freely licensed content or, alternatively, a governance failure in which depth-restricted content was inadvertently trained at full depth. Absent memorization means the log contains no record of the similar content, indicating the similarity is not a consequence of direct training on the artifact. The assessment is reported to the rights-grade governance layer so that the inference substrate can incorporate training-time provenance into its admissibility determination.

Compliance Auditing and Inference-Time Integration

The log enables compliance auditing against content governance requirements. When a content owner inquires whether their content was used in training, the log provides a definitive answer: either the content was present, in which case its provenance record, depth profile, and contribution weights are available, or it was absent, in which case the log confirms its absence. When a regulatory authority requires evidence that restricted content was not deeply integrated, the log provides the depth profile records showing the contribution weights applied to the content, demonstrating that its gradient signal was confined to the layers and magnitudes specified by the governing policy. When a governance auditor requires evidence that depth restrictions were applied correctly, the log provides the admissibility determination records, the policy objects consulted, and the hierarchical resolution logic applied for each example.

The provenance records are also made available to the inference-time substrate. When the inference substrate evaluates a candidate transition, it may query the log to determine the training-time governance profile of the knowledge grounding that transition: the policy objects that governed the content's admission, the depth profile applied, the content provenance record, and any temporal validity constraints. If the grounding content was admitted under a policy that has since expired, the substrate may apply heightened scrutiny or reject the transition; if the policy was revoked by the content owner, the substrate may reject and raise a governance alert; if the content was admitted under a suppressed depth profile indicating a rights restriction, the substrate may require an attribution annotation. In the governed fine-tuning case, the fine-tuning corpus is recorded as a governed fine-tuning provenance record structurally distinct from the pre-training provenance, so that a challenged output can be traced to whether it was produced by parameter regions primarily influenced by pre-training content or by fine-tuning content.

Disclosure Scope

The training provenance log, comprising its per-example recorded elements (entropy band classification, slope position, depth aggregation profile, per-layer contribution weight, governance record, content provenance record, and admissibility determination record), its chronologically ordered append-only and optionally sealed structure, its forward and reverse provenance queries, the content-anchoring of provenance records, the three-category memorization classification (shallow, deep, absent) driven by reverse query, the compliance-audit interfaces, and the supply of training-time governance profiles to the inference-time substrate and to governed fine-tuning provenance, is disclosed in the cognition filing (U.S. Application No. 19/647,395 and its international counterpart). This article describes that disclosed mechanism. The scope extends to embodiments in which the log is realized over different storage representations and to deployment topologies in which the same provenance infrastructure spans the training loop and the inference loop, provided each training decision is recorded as a governed, queryable event and each influence pathway is reconstructable as the bounded set of content structurally permitted to reach the relevant layer blocks.