Skill Regression Detection and Capability Revocation

Nick Clark

Skill Regression Detection and Capability Revocation

by Nick Clark | Published March 27, 2026 | PDF

Skill regression detection compares a rolling window of recent demonstrated performance against the historical baseline that originally qualified the skill for its current tier. When the rolling comparison shows statistically significant degradation against tier-bound thresholds, the architecture automatically downgrades the affected capability — restricting the action surface that the skill unlocks until performance recovers or a re-qualification cycle completes. The mechanism treats skill ratings not as a one-time admission decision but as a continuously revisable contract between the agent and the governance substrate.

Mechanism

The regression detector is structured as a deterministic comparator operating over two time windows of the agent's demonstrated-skill ledger. The reference window captures the performance distribution that originally qualified the skill at its current tier; the rolling window captures the most recent N evaluation episodes. Each episode in the ledger is a credentialed observation containing the task signature, the outcome score, the substrate identifier, and the policy revision under which the action was attempted. Because every episode is recorded in the agent's lineage, the comparator operates on data that any auditor can replay.

On each scheduled evaluation, the comparator computes a tier-specific test statistic — typically a difference of medians paired with a dispersion estimate — and checks whether the rolling distribution has drifted past the threshold the policy reference associates with the skill's current tier. If the threshold is crossed, the architecture emits a regression event, which is itself a credentialed observation. The event triggers the capability-revocation handler, which writes a new tier assignment to the canonical skill registry and propagates the change to every downstream gate that consults the registry before admitting an action. The agent does not have to opt into the downgrade; the gate simply stops admitting the actions that the higher tier authorized.

The downgrade is not a permanent verdict. The same comparator that identified the regression also defines the recovery condition: a forward window in which performance returns above a re-qualification threshold for a minimum number of episodes. Until that condition is met, the lower tier holds. The mechanism therefore behaves like a hysteresis loop — it costs less to lose a tier than to regain it — which is the structurally correct asymmetry for a system that must avoid silent capability inflation while still permitting recovery.

Operating Parameters

The rolling window length, the reference window selection rule, the test statistic family, the regression threshold, and the re-qualification threshold are all governance-credentialed parameters bound to the skill's tier. A high-tier skill — one that authorizes a broader or higher-consequence action surface — uses tighter thresholds, shorter rolling windows, and stricter re-qualification criteria, because the cost of a false-negative regression is greater. A low-tier skill tolerates wider drift before triggering a downgrade, because the cost of a false-positive downgrade exceeds the cost of brief continued use under marginal performance.

Threshold curves are expressed against a tier-specific noise floor estimated from the reference window's dispersion, so that a skill with naturally noisy outcomes is not penalised for normal variance. The noise floor itself is recalculated whenever the reference window is rotated, which keeps the comparator calibrated against the agent's current operating regime rather than against stale early-life statistics. Each parameter change is itself a credentialed policy revision, recorded so that any downgrade event can be reconstructed against the exact threshold curve in force at the moment of detection.

Evaluation cadence is also a parameter. Some skills are evaluated continuously after every episode; others on a scheduled batch; others only when a triggering observation — for example, a downstream consumer flagging an anomalous result — opens an evaluation window. The cadence is bound to the skill, not to the agent, so a single agent can host skills with mixed evaluation regimes without creating ambiguity about when a downgrade applies.

Alternative Embodiments

Several alternative embodiments fall within the disclosure. In the first, the rolling comparator is replaced by a sequential-test variant — a CUSUM or Bayesian change-point detector — which trades fixed-window batch decisions for incremental updates that can flag a regression mid-window. The structural contract with the registry is unchanged; only the test statistic differs. In the second, the comparator runs against a synthetic reference distribution generated from the policy reference rather than from a recorded reference window, which is useful when a skill has been newly admitted and lacks sufficient historical episodes to anchor a comparison.

A third embodiment supports multi-resolution downgrades. Rather than a single boolean tier transition, the registry encodes a graded tier — for example, the skill remains at tier T but with a reduced action subset bound to the affected task signatures. This is appropriate when the regression is concentrated in a specific task family rather than spread uniformly across the skill's surface. A fourth embodiment couples the regression detector to an external evaluator credential — for instance, a domain-certified third-party scoring service — which submits its scores into the ledger as credentialed observations so that the comparator can incorporate independent evidence alongside self-reported outcomes.

A fifth embodiment defers the downgrade through a probationary state: when a regression event fires, the skill is tagged probationary rather than immediately downgraded, and the next K episodes are run under enhanced supervision (additional validation engines, paired-substrate checks, or human-in-the-loop sampling). If probation completes without further regression evidence, the original tier is restored; if probation confirms the regression, the downgrade is committed. The probationary embodiment is preferred in domains where the cost of an unjustified downgrade — for example, switching off a clinically validated capability — is comparable to the cost of operating briefly under a marginal one.

Composition

Regression detection composes naturally with the wider skill-gating apparatus described elsewhere in the cognition patent. The same demonstrated-skill ledger that the gating mechanism consults to decide initial admissibility is the substrate over which the regression comparator operates; there is no parallel data path. The capability registry that downstream gates query is the canonical site at which both initial admission and subsequent downgrades take effect, so the rest of the architecture does not have to distinguish between the two events — it simply consults the current tier.

The mechanism also composes with the LLM proposal pipeline. Where an LLM contributes proposals into agent state via the mutation, validation, and arbitration engines, the validation engine consults the skill registry to determine which classes of proposal are admissible at the current tier. A regression-triggered downgrade therefore tightens the validation surface automatically — the LLM may continue to generate proposals, but the gate will reject those that exceed the post-downgrade tier. No coordination between the regression detector and the validation engine is required beyond the shared registry. This is the same structural pattern by which capability-permission separation, operator-intent admissibility, and tier-weighted fusion all integrate without bespoke wiring: each module reads and writes credentialed observations against a canonical store, and the architecture's coherence is a property of that store, not of any pairwise integration.

Prior-Art Distinction

Existing approaches to language-model and agent monitoring are largely advisory. Telemetry dashboards surface drift metrics for human review; A/B and shadow-mode evaluators compare candidate models offline; reinforcement-learning systems incorporate reward shaping but typically do not bind reward signals to admissibility decisions enforced by an external gate. None of these approaches exhibit the structural feature claimed here: a deterministic comparator whose output is a credentialed observation that automatically rewrites a canonical capability tier, which downstream gates consult before admitting actions, with tier-bound thresholds calibrated against the policy reference under which the skill was originally qualified.

Conventional monitoring is also typically scalar and global — a single accuracy or quality metric covering the model as a whole. The mechanism disclosed here is per-skill, per-tier, and per-task-signature, with thresholds derived from the reference window that qualified the skill rather than from arbitrary global benchmarks. This per-skill structural binding is what makes downgrades governable: an auditor can reproduce the event because the comparator, the reference window, the threshold curve, and the policy revision are all retrievable from the ledger.

Audit and Reproducibility

Every regression event, every downgrade, every probationary state transition, and every re-qualification is a credentialed observation pinned to the policy revision in force at the time. An auditor reconstructing the history of a skill walks the lineage and finds, in order, the original qualifying distribution, the threshold curve under which it was admitted, every rolling-window evaluation that consulted the comparator, every event that fired, every tier transition that resulted, and the policy revisions that altered any parameter along the way. The reconstruction is deterministic because the comparator is deterministic and the inputs are persisted; replaying the lineage on the same data produces the same decisions.

This level of audit is what makes the mechanism suitable for regulated domains. A clinical decision-support skill that downgraded mid-deployment can be defended on the merits — the auditor sees the rolling distribution, the threshold, the trigger condition, the policy revision, and the recovery path — rather than on assertion. A safety-of-life autonomous skill that survived a probationary period can be shown to have done so on the basis of recorded evidence rather than on the operator's confidence. The structural property that delivers this is not the comparator itself but the discipline of writing every decision as a credentialed observation against canonical stores.

Failure Modes Addressed

Several classes of silent skill failure motivate the mechanism. The first is performance drift driven by distributional shift in the input population: a skill admitted against one task distribution continues to be invoked as the population shifts, with quality degrading slowly enough that no single episode triggers an outright rejection. Without a rolling comparator anchored to the qualifying distribution, the degradation is invisible to per-episode gates. The second is upstream-model regression — an updated language model behind the same skill produces structurally similar but materially worse proposals. The third is substrate-induced regression, in which a hardware change, a sensor degradation, or a network condition affects the skill's ability to gather or act on inputs even though the skill logic is unchanged.

The disclosed mechanism addresses all three by attributing regressions through the lineage rather than by speculating about causes. Because each ledger entry records the substrate, the policy revision, and the model identifier alongside the outcome, the regression event carries enough provenance for downstream diagnostics to localise the cause. The downgrade itself does not depend on cause attribution — the tier is reduced regardless — but the recorded provenance enables operators to choose remediation: re-qualify against the new distribution, roll back the model, repair the substrate, or accept the lower tier as the new baseline.

Disclosure Scope

The disclosure covers any embodiment in which a credentialed comparator over a rolling and reference distribution of demonstrated-skill outcomes emits a credentialed regression event that rewrites a canonical capability tier consulted by downstream admission gates, with tier-bound thresholds and a re-qualification asymmetry. The specific test statistic, evaluation cadence, probationary handling, and multi-resolution downgrade structure are illustrative embodiments rather than required features. The disclosure also covers application across domains — autonomous vehicles, companion AI, therapeutic agents, enterprise automation — where the same structural mechanism is parameterised differently per the operating policy. It further covers any combination of the variants described above — sequential-test comparators, synthetic reference distributions, multi-resolution downgrades, external-evaluator credentials, probationary states, and projection-aware evaluation — in any composition consistent with the canonical-store discipline that anchors the architecture.