Semantic Rollback and Checkpoint Recovery

Nick Clark

Semantic Rollback and Checkpoint Recovery

by Nick Clark | Published March 27, 2026 | PDF

When post-execution evidence shows that an inference call should not have proceeded, the rollback procedure reverts the call to a prior checkpoint, bounds the rollback so that work outside the contaminated region is preserved, and notifies downstream consumers whose state was derived from the rolled-back call. Rollback is a structural primitive of the inference-control surface, not a remediation overlaid on top of it.

Mechanism

The rollback mechanism couples three components: a checkpointing layer that captures inference state at well-defined boundaries, a violation detector that emits rollback triggers when post-execution evidence contradicts an in-flight or recently-completed call, and a propagation procedure that carries the rollback to downstream consumers in a bounded, ordered fashion.

Checkpoints are written at every commit boundary of the inference loop. A commit boundary is the point at which a candidate transition has passed the semantic admissibility gate and been incorporated into the inference state. The checkpoint records the state, the lineage entry that produced the state, and the policy bundle identifier active at the commit. Checkpoints are immutable and content-addressed; they accumulate as a log rather than overwriting prior state.

The violation detector consumes evidence from sources that become available after the commit: tool-call returns, downstream verifier outputs, contradicting observations from the agent's perception channels, or post-hoc policy re-evaluation under a successor bundle. When the evidence shows that a prior commit's admissibility verdict would have been reject under the now-available evidence, a rollback trigger is emitted. The trigger names the offending checkpoint and the evidence that produced the trigger; both are recorded in lineage.

The rollback procedure walks the checkpoint log from the offending commit forward, identifying the transitive closure of commits whose lineage descends from the offending one. Commits inside the closure are marked rolled-back; commits outside the closure are preserved untouched. The inference state is reconstructed from the checkpoint immediately preceding the offending commit, and the inference loop resumes from that state with the new evidence available to its admissibility gate.

Operating Parameters

Rollback bounding is parametric along several axes. The temporal bound limits how far backward the rollback may walk; calls whose checkpoint precedes the bound are not rolled back even if they descend from an offending commit, and instead are flagged for out-of-band review. The temporal bound prevents a late-arriving evidence event from invalidating an unbounded prefix of past inference. The bound is recorded in lineage with each rollback.

The lineage bound limits which descendants are rolled back. A descendant whose own admissibility verdict was independent of the offending commit's contribution may be preserved; the procedure tests independence by re-running the descendant's gate against the post-rollback state. Independence-preserved descendants are reattached to the post-rollback timeline; dependent descendants are rolled back with the offending commit.

The notification scope governs which downstream consumers receive rollback events. Consumers register at subscription time with a scope filter expressed against the lineage namespace; when a rollback trigger fires, the propagation procedure delivers the event to every consumer whose scope filter intersects the rolled-back closure. Delivery is ordered: a consumer never observes a rollback for a commit it has not yet observed.

Rollback idempotency is enforced. A rollback of an already-rolled-back commit is a no-op that records the redundant trigger in lineage but does not perturb state. This protects against double-firing when multiple evidence sources independently produce the same trigger.

Alternative Embodiments

The checkpointing layer may be embodied as a copy-on-write snapshot of the inference state, as a delta log against a baseline state, or as a content-addressed Merkle structure where each checkpoint is a commitment to the path from the root. The choice trades storage cost for reconstruction cost; the mechanism is indifferent to the choice provided that any prior state is recoverable in bounded time.

The violation detector may be embodied as an in-process component of the inference runtime, as a sidecar that consumes the runtime's lineage stream, or as an external service that subscribes to lineage events. The sidecar embodiment is preferred where the detector must operate independently of the runtime's compute budget; the in-process embodiment is preferred where detection latency dominates.

The propagation procedure may be embodied as a synchronous call to each downstream consumer, as an asynchronous event delivered through a message bus, or as a poll-based mechanism where consumers periodically reconcile against the current rolled-back set. The mechanism's correctness requires only that the ordering and scope properties hold; the delivery substrate is a deployment-time choice.

Rollback may be embodied as a hard rollback that discards the rolled-back commits or as a soft rollback that retains them in lineage with a rolled-back marker. Soft rollback preserves auditability of the original trajectory; hard rollback reduces storage. The two embodiments are interoperable: a system may write soft rollbacks and apply a retention policy that converts them to hard rollbacks after a configurable interval.

Composition

Rollback recovery composes with the policy-governed admission gate. When a successor policy bundle is admitted whose predicates would have rejected a prior commit, the bundle's admission produces a rollback trigger against the affected commits. The rollback procedure then operates as it would for any other evidence-driven trigger; the difference is only in the source of the evidence.

Rollback recovery composes with the lineage substrate. Every checkpoint, every trigger, every propagation event, every consumer notification is a lineage record. The lineage record of a rollback is sufficient to reconstruct the pre-rollback and post-rollback inference states without consulting any other artifact, which makes the procedure auditable end-to-end.

Rollback recovery composes with the trust-slope mechanism. A rollback event lowers the trust slope contribution of the source whose evidence triggered the rollback if subsequent investigation finds the trigger spurious; conversely, a confirmed trigger raises the source's trust slope. The bookkeeping is performed by the trust-slope mechanism, not by the rollback procedure itself, but the rollback procedure is the event source.

Prior-Art Distinction

Database-style transactional rollback is the closest prior-art analogue but operates over a different substrate. Database rollback reverts a transaction whose constituent operations have not yet committed, on the basis of an explicit abort signal raised before commit. The mechanism here reverts commits that have already been committed, on the basis of evidence that becomes available after commit. The post-commit-evidence trigger is the structural distinction.

Conventional AI safety overlays apply post-generation filters that suppress problematic outputs before they reach a user. Suppression is not rollback; the suppressed output remains in the inference state, contaminates subsequent inference, and consumes downstream compute. The mechanism here removes the problematic commit from the state itself and propagates the removal to downstream consumers, so the contamination does not persist.

Checkpoint-restart systems in distributed computing restore a process to a prior snapshot after a crash. The trigger is process failure rather than semantic violation, and the granularity is the entire process rather than individual commits within an inference loop. The mechanism here operates at the per-commit granularity and treats semantic violation as a first-class trigger event.

Implementation Considerations

The checkpoint storage budget is the first practical concern. Per-commit checkpointing in a deep inference loop produces a long log; the log's storage cost grows linearly with inference throughput. Production deployments resolve the cost through retention policies that retain full checkpoints within a recent window and reduced-fidelity checkpoints outside the window. The reduced-fidelity form retains the lineage record and the policy bundle identifier but compresses or discards the inference state itself; rollback into the reduced-fidelity region is correspondingly limited. The mechanism is compatible with retention policies provided that the temporal bound on rollback is not configured to exceed the full-fidelity retention window.

The propagation latency budget is the second. A rollback that fails to reach a downstream consumer before the consumer commits derived state to its own users is a rollback that has not actually contained the contamination. Production deployments resolve the budget by registering downstream consumers with synchronous propagation if their commit-to-user latency is short and with asynchronous propagation if their latency is long enough that a propagation event will reliably reach them before they emit. Mismatched configuration is the customary failure mode and is detected by reviewing rollback events against downstream commit timestamps.

Trigger-source trust is the third. A violation detector that fires on poorly-correlated evidence will produce frequent false rollbacks, each of which discards work that should have been preserved. A detector that fires only on high-confidence evidence will miss real violations. The trust-slope mechanism partially compensates by adjusting the weight of trigger sources over time, but the initial calibration of each trigger source is a per-deployment concern. Conservative initial calibration combined with trust-slope-driven refinement is the customary discipline.

Cross-runtime coordination is the fourth practical concern. In a deployment where multiple inference runtimes share a lineage substrate but operate independently on overlapping work, a rollback in one runtime may invalidate state that another runtime has already consumed. The propagation procedure handles the consumer notification, but the runtimes themselves must be configured to honor incoming rollback events as authoritative against their own derived state. A runtime that ignores cross-runtime rollback events will retain contaminated state even after the originating runtime has reverted; the contamination will surface later as a divergence between the two runtimes' lineage records. Production deployments verify cross-runtime rollback honoring through periodic reconciliation of lineage records across the participating runtimes.

Operator visibility is the fifth. A rollback is a costly event whose occurrence carries information about either the inference system, the policy bundle, or the trigger source. Each rollback should produce a record in an operator-facing surface so that patterns can be detected; a rollback storm in a narrow time window is a different signal than a steady background rate of independent rollbacks, and the two require different operator responses. The mechanism produces lineage records that are sufficient for the operator-facing surface to be constructed; constructing it is a per-deployment concern.

Disclosure Scope

The disclosed mechanism covers any inference system that captures per-commit checkpoints with lineage and policy binding, accepts post-commit evidence as a rollback trigger, performs a bounded transitive rollback over the lineage closure of the triggered commit, and notifies registered downstream consumers in an order-preserving fashion. The mechanism is independent of the checkpoint substrate, the evidence source, the propagation transport, and the consumer subscription model.

The cognition patent specification describes one embodiment of the mechanism. The disclosure is not limited to that embodiment. The claims encompass any system whose checkpointing, evidence-triggered reversion, bounded propagation, and ordered notification together produce the structural properties recited above, regardless of the storage form of checkpoints, the surface form of evidence, the topology of consumers, or the implementation of the propagation procedure.