Inference-Time Semantic Budget
by Nick Clark | Published March 27, 2026
Every inference call carries an explicit semantic budget along four dimensions: tokens consumed, fan-out across parallel branches, depth of recursive expansion, and elapsed time. Exceeding any dimension is non-execution, not degraded execution; the inference simply does not occur. The budget for each call is derived from the agent's capability tier, binding what an agent is permitted to think about to who that agent is permitted to be.
Mechanism
The semantic budget is a four-dimensional resource envelope attached to every inference call as a first-class argument. The four dimensions are token count, fan-out, depth, and time. Token count caps the cumulative size of prompts, intermediate reasoning, and outputs measured in the model's native tokenization. Fan-out caps the number of parallel candidate branches the agent may explore at any single decision point. Depth caps the recursive nesting permitted within a single call, including tool invocations that themselves invoke inference. Time caps the wall-clock duration the call may consume from initiation to commit.
The budget is checked at admission and again at every internal expansion point. At admission, the runtime compares the requested or implied envelope against the caller's current allocation; if any dimension exceeds the allocation the call is refused before any model work begins. During execution, each token generated, each fan-out branch instantiated, each level of recursive descent, and each elapsed millisecond is debited from the running totals. The first dimension to be exhausted terminates the call. Termination from budget exhaustion is structurally distinct from termination by completion: the runtime emits a budget-exceeded event with the offending dimension, the partial state is discarded, and no result is returned to the caller.
This distinction is the central structural property of the mechanism. Conventional resource limiters truncate output, return best-effort partial results, or silently degrade quality when limits are approached. The semantic budget treats exhaustion as non-execution: the inference either runs to completion within its envelope or it does not run at all, from the caller's point of view. The semantic state of the agent is not contaminated by partial reasoning, no commitments are recorded, and the lineage shows a refused call rather than a fulfilled call with degraded output. Callers that need fault tolerance must request a larger envelope or decompose the work into smaller calls; they cannot rely on an undersized call to produce something usable.
Each call's envelope is derived from the capability tier assigned to the calling agent or sub-agent. Capability tiers are declared in policy and bind a principal to a maximum permitted envelope across all four dimensions. A request that exceeds the principal's tier is refused at admission, regardless of whether the underlying infrastructure has spare capacity. This is what binds budget to identity: the limit is not a function of load or pricing but of who the principal is permitted to be at the moment of the call.
Operating Parameters
The capability-tier table is the primary operator-facing parameter. Each tier declares a token cap, a fan-out cap, a depth cap, and a time cap. Tiers are partially ordered: a higher tier dominates a lower tier on every dimension, so an upgrade is unambiguous and a downgrade cannot widen any dimension. The table is held in the policy reference and is subject to the same admission, audit, and change-control rules as any other policy artefact.
A delegation rule specifies how an agent's envelope is split when it invokes a sub-agent or tool that itself performs inference. The default is strict subtraction: the parent's remaining envelope is reduced by the child's envelope at the moment of delegation, and the child cannot exceed the parent's allocation along any dimension. Alternative rules, such as proportional reservation or named pool sharing, are permitted and are declared in the same policy artefact, with each delegation event logged so that an auditor can reconstruct who paid for which inference.
A refund policy governs whether an aborted call's partial consumption is returned to the caller's pool. The default is no refund: tokens read, branches expanded, and time elapsed are debited even when the call ends in a budget-exceeded event, because allowing refunds would create an incentive to probe the boundary. Operators may declare a refund policy that returns a fraction of unused dimensions when the abort is caused by an external interrupt rather than by budget exhaustion, but the refund must itself be deterministic and auditable.
A measurement contract declares precisely how each dimension is counted: which tokens count, when fan-out is debited, what constitutes a depth increment, and which clock supplies the elapsed-time reading. The contract must be deterministic with respect to input so that two replays of the same call produce the same debits. Operators select a measurement contract at deployment and may not change it without a corresponding policy version bump, which guarantees that historical lineage records remain interpretable.
Alternative Embodiments
A single-tenant embodiment binds capability tiers directly to user identities, with each user assigned a tier on the basis of authentication, subscription level, or trust score. Inference calls inherit the caller's tier, sub-agent delegations subtract from the caller's envelope, and budget exhaustion produces an explicit non-execution event surfaced to the user.
A multi-tenant embodiment groups identities into roles or organizations, with tiers attached to the role rather than to the individual. The mechanism is unchanged at the call site; only the tier-resolution step differs. This embodiment is suited to enterprise deployments where budget governance is a function of seat licensing rather than per-user provisioning.
An adversarial-isolation embodiment assigns sharply restricted tiers to inputs sourced from untrusted channels, even when the agent's own identity would otherwise license a wider envelope. The reduction is structural: the agent processes the input under the lower tier and is incapable of widening the envelope mid-call to accommodate adversarial complexity. This embodiment is suited to deployments where prompt injection or content amplification is a known threat.
A real-time embodiment makes time the dominant dimension by setting it well below the natural exhaustion point of the other three. The inference must complete within a hard deadline that is enforced at the runtime level, with the other dimensions serving primarily as defense in depth. This embodiment is suited to control loops and to interactive systems where latency itself is a safety property.
A degenerate embodiment sets all four caps to a single, constant value across every tier, producing a system in which budget governance is uniform. This is structurally identical to the general mechanism with its tier surface collapsed and is admissible when an operator's risk posture is uniform across principals.
Enforcement Semantics
The non-execution semantics deserve elaboration because they are easy to misread as ordinary timeouts. When a call exceeds any of its four dimensions, the runtime treats the call as if it had never been admitted from the perspective of the agent's semantic state. No partial output reaches the caller, no intermediate reasoning is committed to memory, and no downstream system that depends on the agent's outputs receives a degraded signal. The lineage records a refused-call event with the offending dimension, the recorded debits, and the call's full input context, so that an operator can decide whether to reissue the call under a wider envelope, decompose it into smaller sub-calls, or accept the refusal as a valid outcome.
Refused calls are first-class outcomes of the inference pipeline. A caller that needs a result must be prepared to handle a refusal; conversely, a caller that does not handle refusals will simply lack the output, never an unsafe approximation of one. This forces a discipline on system designers: graceful degradation under load is an explicit application-layer concern, decomposed into smaller calls that each fit within tier, rather than an implicit runtime behaviour that might silently produce truncated reasoning.
The runtime distinguishes structurally between three terminal call states. A completed call returns its result and records a completion event with its final debits. A refused call records a refused-call event before any model work begins, when admission detects that the request exceeds tier. An exhausted call records a budget-exceeded event with the offending dimension during execution, after some debits have accrued but before any result is returned. Audit consumers can filter on these three event types to answer different questions: refusals indicate a tier-policy mismatch, exhaustions indicate an envelope-policy mismatch, and completions provide the baseline against which the other two are evaluated.
Composition
The semantic budget composes with affect-admissibility through shared multipliers: when affect-admissibility raises the gate to its elevated regime, fan-out and depth multipliers are applied to the active budget, which both narrows exploration and accelerates exhaustion. The two mechanisms cooperate so that affective stress and budget pressure produce consistent rather than conflicting effects on inference behaviour.
It composes with the dream state by sharing the same energy accounting unit. A dream session debits the same budget pool as waking inference, which prevents an agent from rehearsing under richer conditions than it could ever execute under, and which allows operators to balance dreaming and waking effort within a single tier rather than maintaining parallel limits.
It composes with the lineage system by attaching the envelope, the running debits, and the exhaustion event, if any, to every call record. This makes budget consumption a first-class auditable artefact: a reviewer can answer not only what an agent decided but how much semantic resource it was permitted to spend deciding it, and whether any decision was foreclosed by budget exhaustion that an operator might want to revisit.
Prior-Art Distinction
Rate limiting, token quotas, timeouts, and recursion caps are individually conventional. The semantic budget is not a claim over any one of these. It is a claim over the structural combination of four dimensions enforced together as a single envelope, with exhaustion treated as non-execution rather than as degraded execution, and with the envelope tied to a capability tier that is itself a policy-declared identity attribute.
Conventional resource limiters allow partial results, retry-after responses, or silent quality degradation. They typically govern only one or two dimensions independently, treat budget as an infrastructure or pricing concern rather than as an identity-bound capability, and lack a deterministic non-execution semantics. The semantic-budget mechanism distinguishes itself by enforcing all four dimensions as a single admission decision, by binding the envelope to a capability tier, and by guaranteeing that an over-budget call leaves the agent's semantic state untouched.
Disclosure Scope
This article discloses the inference-time semantic budget as a structural mechanism of the cognition patent's inference-control surface. The disclosure is independent of the specific model family, tokenizer, or runtime employed and applies to any inference pipeline whose admission and expansion points can be instrumented to enforce a four-dimensional envelope. The disclosure covers single-tenant, multi-tenant, adversarial-isolation, real-time, and degenerate-uniform embodiments, and the tier surface described is illustrative rather than enumerative.
Implementations that permit partial results when any dimension is exhausted, that decouple the envelope from a policy-declared capability tier, or that govern fewer than the four declared dimensions fall outside the scope of the mechanism as disclosed. Implementations that enforce all four dimensions, treat exhaustion as non-execution, and bind the envelope to capability tier remain within scope regardless of the underlying inference engine.