Mechanism

Depth-selective gradient routing is the mechanism by which a training example's gradient contribution is integrated into specific layers of a model based on the semantic properties of the content that produced the gradient. The disclosure positions the semantic execution substrate at the boundary between the forward-pass loss computation and the backward-pass gradient application. Gradients are computed as in conventional training, but the gradient signal is modulated, gated, or selectively routed across model depth based on an admissibility determination produced by the substrate. The substrate does not alter the mathematical machinery of gradient computation or optimizer updates; it governs which gradient signals reach which layers and with what magnitude, based on the semantic properties of the training content.

The routing is driven by a training depth profile, which is a structured data object comprising a per-layer or per-block contribution weight vector. For a model comprising L layers or B blocks, the depth profile specifies a weight value for each layer or block, where the weight governs the magnitude of the gradient signal from the associated training example that is permitted to influence the parameters of that layer or block. A weight value of one permits the full gradient signal to reach the layer. A weight value of zero prevents any gradient signal from the example from reaching the layer. A weight value between zero and one attenuates the gradient signal by the specified factor, and a weight value greater than one amplifies it. The per-layer weight vector collectively defines the depth-aggregation profile: the shape of the example's contribution across the model's depth dimension.

Entropy-Band-Indexed Depth Profiles

Each entropy band recognized by the platform's entropy extraction pipeline is associated with a training depth profile that governs how content from that band is selectively weighted across the layers of the model. Training examples with low semantic entropy, that is content well represented in the model's existing knowledge, receive shallow depth profiles. Training examples with high semantic entropy, that is content introducing novel semantic structure, receive deep depth profiles. The depth profile for high-entropy content specifies elevated contribution weights for the model's deeper layers, where multi-step abstraction and novel pattern synthesis occur. The depth profile for low-entropy content specifies elevated weights for the shallower layers and attenuated or zero weights for the deeper layers, preventing routine knowledge from consuming deep representational capacity.

The association between entropy bands and depth profiles is not fixed for the duration of training. It is derived from the platform's slope-band structure and is adapted as the model's internal entropy distribution evolves. During early training, when the model's representations are undifferentiated, the depth profiles may be broad, with examples from all bands contributing to all layers at approximately uniform weighting. As training progresses and the model's representations stratify, the depth profiles narrow. A profile adaptation engine monitors the model's internal entropy distribution at defined checkpoints, evaluating the entropy characteristics of the layer-wise representations and adjusting the depth profiles to maintain alignment between the entropy band structure of the corpus and the entropy structure of the model's internal representations.

Three Routing Techniques

The depth-selective aggregation mechanism is implemented through one or more of three complementary techniques: gated residual connections, attention-based depth selection, and layer-specific scaling factors. Each technique achieves the same functional result, modulating the contribution of a training example's gradient signal to specific layers according to the depth profile, but operates through a different structural mechanism, enabling adaptation to different model architectures.

The gated residual connection technique augments each residual connection with a gating coefficient derived from the example's depth profile. The coefficient multiplies the gradient signal flowing through the residual connection, attenuating or amplifying it according to the per-layer weight. A coefficient of zero at a layer prevents the example's gradient from influencing that layer through the residual pathway; a coefficient of one permits full flow. The gated residual connections are applied during the backward pass only, so the forward pass proceeds with standard residual connections and the model's inference behavior is unaffected.

The attention-based depth selection technique modulates the attention mechanism in transformer architectures. The attention computation at each layer receives the depth-profile weight for the current example at the current layer, which scales the gradient flowing through the attention weights and value projections during the backward pass without altering the forward-pass attention computation. The layer-specific scaling factor technique is model-architecture-agnostic: it applies a scalar multiplier to the gradient signal at each layer boundary during the backward pass, multiplying the gradient by the depth-profile weight before the gradient is accumulated into the layer's buffer. Because it requires only the ability to intercept and scale the gradient at each layer boundary, the scaling factor technique applies to any gradient-trained architecture, including convolutional, recurrent, and mixture-of-experts networks.

Block Granularity and Optimizer Interaction

The depth-selective aggregation mechanism operates at block-level granularity rather than at individual-layer granularity. In networks comprising many layers, per-layer depth profiles would require unwieldy weight vectors and impose impractical evaluation overhead. Layers are instead grouped into blocks, which are contiguous sequences performing a coherent computational function, and the depth profile specifies per-block aggregation weights that govern the gradient flow to all layers within the block uniformly. A block may comprise a single transformer layer, a group of residual layers, a single attention head, or any other architecturally meaningful grouping. Block-level granularity balances the expressiveness of depth-selective control against the tractability of profile evaluation and gradient modulation.

The mechanism interacts with conventional gradient accumulation as follows. In conventional mini-batch training, the gradient for each parameter is computed for each example and accumulated, typically by summation or averaging, across the batch before the optimizer applies the update. In depth-selective training, the gradient for each example is first scaled by the depth-profile weight at each block before being accumulated into the block's gradient buffer. This per-example scaling occurs after the per-example gradient computation and before the batch-level accumulation, ensuring each example's contribution to each block is individually governed by its profile. The mechanism is compatible with standard optimizers including stochastic gradient descent, Adam, AdamW, and their variants: it does not alter the optimizer's update rule, only the gradient signal the optimizer receives. The optimizer operates on the modulated gradient buffer as though it were a standard buffer, applying its learning rate, momentum, and weight decay computations without modification.

Policy-Governed Retention and Structural Prevention

The aggregation mechanism is integrated with the platform's content governance to implement policy-governed knowledge retention and suppression at the architectural level. Content admitted under time-limited licensing agreements is trained with a suppressed depth profile, in which the contribution weights for the deeper layers are set to zero or near-zero, confining the example's influence to the shallower layers. Content whose license may expire or be revoked should not be deeply encoded, where removal would require invasive procedures such as full retraining; confining it to shallow layers makes its influence structurally separable from the model's deep knowledge. Content from the governed exclusion corpus is structurally prevented from any integration through a zero-weight depth profile, which sets the contribution weight to zero at every layer. The exclusion is recorded in the training provenance log as a governed event.

The disclosure distinguishes this from post-hoc unlearning, which operates after training by approximating the influence of content on the parameters and applying corrective updates. Post-hoc unlearning is inherently approximate because the influence of any single example is diffused across many parameters through the non-linear dynamics of optimization. Depth-selective routing instead implements structural prevention: content whose governance profile restricts deep integration is prevented from deep integration at training time, before the gradient reaches the deep layers. There is no need to unlearn what was never deeply learned. The prevention is exact rather than approximate, since a zero weight at a given block means no gradient from the example reaches that block's parameters, and it is deterministic, auditable, and reversible by changing the depth profile. The same policy objects that govern agent behavior and inference-time admissibility are consulted during training: when a policy expires or is revoked, the substrate applies a zero-weight profile and records the event. Where multiple policies apply to a single example, the substrate resolves the applicable profile by applying the most restrictive policy.

Differential Privacy by Architectural Confinement

The aggregation mechanism is applied to implement per-content differential privacy guarantees that are more targeted than global differential privacy. In conventional differential privacy for machine learning, Gaussian or Laplacian noise is added uniformly to all gradient signals regardless of the sensitivity of the content that produced them, with noise calibrated to the worst-case privacy requirement across the entire corpus, which degrades accuracy for content that does not require protection.

The depth-selective privacy mechanism routes privacy-sensitive content's gradient contributions primarily to shallow layers, where representations are generic, distributed, and inherently less memorizable, while suppressing contribution to deep layers, where representations are specific, localized, and more susceptible to memorization. The depth profile for privacy-sensitive content specifies high gating coefficients at shallow blocks and low or zero coefficients at deep blocks, confining the content's influence to the model's generic representational capacity. The guarantee is structural rather than statistical: the content is protected not by noise injection but by architectural confinement, and the model cannot memorize what it was not permitted to encode in memorizable layers. Because the per-content guarantee is independent of the privacy requirements of other content, non-sensitive content may be trained with full-depth profiles, eliminating the accuracy-privacy tradeoff inherent in global differential privacy.

Provenance and Memorization Detection

The substrate records a depth-aggregation profile and a per-layer contribution weight for each training example in the training provenance log, alongside the entropy band classification, slope position, governance record, and content provenance record. The per-layer contribution weight records the actual gradient magnitude that reached each block after depth-selective modulation, accounting for any dynamic adjustments made by the profile adaptation engine. These records enable training-level memorization detection: when model output at inference time is flagged as exhibiting high similarity to a known training artifact, a reverse provenance query retrieves the depth-aggregation profiles and contribution weights of the corresponding examples and classifies the similarity as shallow memorization, where a suppressed profile confined influence to shallow layers, deep memorization, where a full-depth or deep-weighted profile permitted influence to reach the deeper layers, or absent memorization, where the log contains no record of the content.

The depth profile records also support liability allocation. Because pre-training and fine-tuning content are integrated through distinct depth profiles, their contributions occupy distinguishable layer-block regions, enabling an attribution system to resolve which parameter regions were activated during a challenged output and which provenance chain those regions belong to.

Disclosure Scope

Depth-selective gradient routing, comprising the training depth profile as a per-layer or per-block contribution weight vector, the entropy-band-indexed association adapted by the profile adaptation engine, the three implementation techniques of gated residual connections, attention-based depth selection, and layer-specific scaling factors, the block-level per-example scaling that precedes batch accumulation, the suppressed and zero-weight profiles that effect policy-governed retention and structural prevention, the differential-privacy-by-confinement application, and the provenance records that support memorization detection and attribution, is disclosed in U.S. Application No. 19/647,395 and its international counterpart. This article describes that disclosed mechanism. The scope extends to network architectures not enumerated that use gradient-based optimization, to optimizer variants whose update rule operates on the modulated gradient buffer without modification, and to embodiments in pre-training, fine-tuning, and on-device adaptation contexts, provided the depth at which each example's gradient contribution is integrated is governed by the semantic properties of the content that produced the gradient.