Multimodal Anti-Gaming Substrate

by Nick Clark | Published March 27, 2026 | PDF

Skill gating resists gaming through three structural properties combined within a single substrate: asymmetric feedback that returns more information on legitimate progress than on probing, evidence accumulation that demands diverse cross-modal corroboration before a skill unlocks, and structural sandboxing that prevents the unlock condition from being read or written outside the gate's authority. Metrics-only optimization, surface mimicry, and shortcut exploitation cannot satisfy the unlock predicate because the predicate is not expressible as a function of any single observable channel.


Mechanism

The anti-gaming substrate sits beneath every skill gate and mediates the relationship between the candidate's behavior and the gate's unlock state. Three structurally distinct properties act in concert. The first is asymmetric feedback. When a candidate's action contributes legitimate evidence toward a skill's unlock predicate, the gate returns informative feedback that names the dimension of progress and bounds the residual evidence still required. When a candidate's action is consistent with a probing strategy that searches for the unlock condition without genuinely producing the underlying competency, the gate returns only a uniform null acknowledgment. The information gradient that an attacker can extract by repeated probing is bounded near zero, while a legitimate user receives a strong learning signal. This asymmetry is enforced at the gate boundary and is independent of any policy the model running above it may apply.

The second property is evidence accumulation across heterogeneous modalities. Each skill is associated with a manifest that enumerates the evidence classes required for unlock, drawn from distinct modalities such as procedural execution traces, declarative articulation, cross-domain transfer, temporal stability across separated sessions, and interaction with adversarial counterexamples. The unlock predicate is a conjunction over these classes with minimum thresholds and minimum diversity, so satisfying any single class to arbitrary depth cannot substitute for the missing classes. The substrate maintains a per-candidate evidence ledger that is append-only and signed at the gate boundary, so prior evidence cannot be retroactively replayed to satisfy a freshly required class.

Each evidence class is itself defined by a typed schema in the manifest, specifying the canonical form an evidence contribution must take, the verification function that scores its strength, and the decay function that reduces its weight as time passes without corroboration. The verification function is deterministic and audited, so an entry's score is reproducible from its raw form. The decay function ensures that stale evidence cannot indefinitely sustain an unlock state without continued engagement, which is itself an anti-gaming property: a one-time burst of authentic activity cannot purchase permanent access to a skill whose exercise should depend on sustained competency. The evidence ledger is therefore not merely a record but a living substrate whose state evolves under both new contributions and the passage of time.

The third property is structural sandboxing of the unlock state. The unlock predicate, the evidence ledger, and the gate's internal counters are held inside a sandbox whose interface to the surrounding agent exposes only the verdict (locked, accumulating, unlocked) and the redacted progress vector. Internal state is unreadable and unwritable from outside, so an attacker who fully controls the calling model cannot inspect the predicate to plan a minimal-cost satisfaction strategy, nor write to the ledger to fabricate evidence. The sandbox is enforced by the runtime's privilege boundary; bypassing it requires compromising the runtime itself, which is outside the threat model the gate addresses but inside the threat model addressed by integrity-aware inference and environment binding.

Operating Parameters

The asymmetry coefficient governs how steeply feedback informativeness drops as the gate's classifier estimates the action's probing-likelihood. A high coefficient produces sharp drop-offs that strongly resist optimization but increase false negatives on legitimate but unusual learning paths; a lower coefficient softens this trade-off. The coefficient is a policy parameter and may be tuned per skill, with safety-critical skills receiving sharper asymmetry than exploratory ones.

The evidence diversity threshold sets the minimum number of distinct modalities that must contribute non-trivial evidence before unlock is even considered. Per-modality thresholds set the minimum strength within each class. Both are declared in the skill manifest and signed by the policy authority, so they cannot be negotiated by the candidate or the calling model. The temporal stability window enforces a minimum elapsed duration and a minimum number of independent sessions over which evidence must persist, defeating burst attacks that compress a synthetic mastery performance into a single sitting.

Sandbox parameters govern which agent roles may receive verdict notifications, which may read the redacted progress vector, and which may query the manifest. A least-privilege default exposes only the verdict to the cognitive role that consumes the unlocked skill, withholds progress information from the candidate-controlled surface entirely, and exposes manifests only to the governance role.

Alternative Embodiments

In a single-process embodiment, the substrate is implemented as a kernel module of the cognitive runtime, with the sandbox enforced by language-level capability isolation. In a multi-process embodiment, the substrate runs as a separate process with the sandbox enforced by operating-system boundaries, and the verdict channel is a typed IPC. In a distributed embodiment, the substrate is hosted on a remote attestation service, and the gate boundary is an authenticated network protocol; this fits multi-tenant platforms where many agents draw from a shared skill catalog.

In a fully-deterministic embodiment, the probing-likelihood classifier is replaced by rule-based action typing drawn from the policy reference, eliminating the classifier as an attack surface at the cost of expressive precision. In a learned-classifier embodiment, the classifier is itself audited by a meta-evidence channel that monitors its calibration; drift in the classifier triggers re-attestation of any skills unlocked under the drifting regime.

A graduated embodiment exposes partial unlock states for skills with sub-skills, where each sub-skill has its own manifest and the parent skill's unlock predicate is a structural function of its children. A binary embodiment exposes only locked and unlocked, suited to skills whose granularity is naturally indivisible.

Composition

The anti-gaming substrate composes with the narrative-personality field to provide identity-anchored evidence: persistent personality coherence across sessions counts as one evidence class, and impersonation breaks that class. It composes with integrity-aware inference, which guarantees that the model calls producing candidate evidence are themselves provenanced, so a compromised model cannot inject fabricated evidence into the ledger. It composes with the trust slope mechanism: a degrading slope tightens asymmetry and raises diversity thresholds, making unlocks harder to achieve during periods of suspect behavior.

Composition with policy-reference loading ensures that manifest changes propagate atomically: a manifest update that adds a required evidence class causes any in-flight unlocks under the prior manifest to be re-evaluated against the new requirements, with a configurable grace period during which already-unlocked skills remain available pending evidence top-up. This prevents both surprise revocation and silent grandfathering of skills whose evidence basis no longer matches policy.

The substrate also composes with downstream skill-bounded action selection. A skill that is unlocked can be revoked by the same substrate when retrospective audit shows that the evidence ledger was satisfied through means later classified as gaming; revocation is itself a logged transition, and any actions taken under the revoked skill are flagged for review.

Distinction Over Prior Art

Reward modeling and reinforcement-learning-from-human-feedback approaches address gaming by penalizing detected exploitation patterns post hoc, but they expose the optimization target as a measurable scalar that sufficiently capable optimizers can game. Capability evaluation suites assess static benchmarks whose unlock conditions are knowable to the candidate, inviting benchmark-specific overfitting. Multi-factor authentication systems combine credentials but do not enforce evidence diversity across cognitive modalities, nor do they sandbox the verdict from the calling agent.

The substrate's distinction is the structural combination of asymmetric information return, conjunctive cross-modal evidence with diversity and temporal-stability requirements, and an unlock state that is sandboxed from the optimizer entirely. No single property is novel in isolation; the structural combination, applied at the skill-gate boundary of an autonomous cognitive system, is.

Threat Model and Resistance Properties

The substrate is designed against three principal attacker classes. The metrics-only optimizer attempts to maximize an inferred unlock signal by repeated probing of the gate, treating the unlock condition as a black-box reward. The asymmetric feedback property collapses this attacker's information channel: the gate releases meaningful gradient only along legitimate evidence dimensions, so the optimizer's gradient estimates over probing actions converge to zero. Without a usable gradient, the optimizer reduces to brute-force enumeration over a search space whose effective volume scales with the conjunctive cross-modal predicate, which is super-exponential in the number of evidence classes.

The surface mimic attempts to imitate observed unlocks by reproducing externally visible behaviors of legitimate users. Cross-modal evidence accumulation defeats this attacker because the modalities the substrate measures include features not visible on the surface: temporal stability across sessions the mimic did not participate in, cross-domain transfer to scenarios not previously enumerated, and adversarial-counterexample interactions whose correct handling requires actual competency rather than memorized responses. The mimic's coverage of the evidence manifold is shallow even when wide.

The structural attacker attempts to read or write the unlock state directly, bypassing the evidence requirement. The sandbox boundary makes this attacker's capability dependent on compromising the runtime's privilege isolation, not the gate logic. Compromising the runtime is the threat model addressed by integrity-aware inference and environment binding, which compose with this substrate as defense-in-depth. The substrate's resistance properties hold under the assumption that the runtime is intact; when that assumption fails, upstream mechanisms detect the compromise and revoke all unlocks issued under the affected runtime.

Hybrid attackers combine these strategies, and the substrate's properties combine accordingly. An attacker who optimizes against asymmetric feedback while mimicking surface behavior still cannot satisfy cross-modal evidence diversity; an attacker who satisfies surface modalities but cannot penetrate the sandbox cannot read the predicate to plan minimal-cost satisfaction. The combination of properties is more than the sum of the individual defenses.

Disclosure Scope

This disclosure covers any skill-gating system in which unlock is conditioned on a conjunction of cross-modal evidence classes with diversity and temporal-stability requirements, in which feedback to the candidate is information-asymmetric with respect to probing-likelihood, and in which the unlock predicate and evidence ledger are held in a sandbox unreadable and unwritable from the candidate-controlled surface. The disclosure extends to embodiments that vary the implementation of the sandbox boundary, the form of the probing-likelihood classifier, the cardinality and identity of the evidence modalities, and the integration with upstream provenance and trust mechanisms.

The disclosure further extends to applications in companion AI competency progression, autonomous-agent capability authorization, regulated-domain certification of AI operators, and educational systems in which mastery claims must be defended against optimization-pressure gaming.

Nick Clark Invented by Nick Clark Founding Investors:
Anonymous, Devin Wilkie
72 28 14 36 01