Multimodal Evaluation Pipeline

Nick Clark

Multimodal Evaluation Pipeline

by Nick Clark | Published March 27, 2026 | PDF

Skill admission is gated by multimodal evidence. The evaluator simultaneously consumes the textual reasoning trace, the proposed action plan, and the sensor or environment inputs that motivated the skill, and the evaluation function is constructed such that evidence drawn from any single modality is structurally insufficient to admit a skill into the agent's executable capability set.

Mechanism

The multimodal evaluation pipeline is the gating mechanism by which a candidate skill, proposed by a language model or generative subsystem, is either admitted into the agent's executable capability set or rejected and returned to the proposer. The pipeline is defined in Chapter 7 of the cognition patent as a deterministic evaluation function whose inputs are drawn from at least three structurally distinct modalities and whose output is a binary admission decision accompanied by a per-mode evidence vector recorded in the agent's lineage.

The first modality is the textual reasoning trace. When a language model proposes a new skill, it must accompany the proposal with a natural-language justification describing the skill's purpose, its expected effect on agent state, and the conditions under which the skill is intended to fire. The reasoning trace is parsed into structured assertions, and each assertion is checked for consistency against the agent's existing belief base and policy reference. A reasoning trace that asserts a precondition contradicting an established invariant is flagged as a per-mode failure.

The second modality is the action plan. The proposer must emit a concrete, executable plan expressed in the agent's canonical action vocabulary, specifying the operators that the skill would invoke, the order in which they would fire, the resources they would consume, and the post-conditions they would establish. The action plan is symbolically simulated against a model of the agent's environment, and the simulation produces a trace of intermediate states that is itself recorded as evidence.

The third modality is the sensor or environment input. The skill must be motivated by observable evidence drawn from the agent's perceptual channels, whether those channels are physical sensors, API responses, document corpora, or user utterances. The evaluator extracts the salient features from the input and verifies that the proposed skill responds to a real, currently-present condition rather than to a hallucinated stimulus.

The pipeline's structural requirement is that admission requires concordant evidence across all three modalities. A skill that is well-justified in text but unsupported by sensor evidence is rejected. A skill that is sensor-motivated but has no coherent action plan is rejected. A skill that has a plausible action plan but a textual justification that contradicts the agent's policy is rejected. Concordance is computed by a deterministic cross-modal aggregator that weights each modality according to the policy reference and produces a single admission score above a configurable threshold.

Operating Parameters

The pipeline is parameterized by the per-mode threshold vector, the cross-modal aggregation weights, the simulation depth used during action-plan evaluation, and the maximum admissible disagreement among modalities. Each parameter is specified declaratively in the policy reference and is auditable without inspection of the underlying model weights or generative process.

Per-mode thresholds are calibrated such that no single modality can dominate the admission decision. The aggregator is constructed so that a high score in one modality cannot compensate for a sub-threshold score in another; the admission function is monotonic in each input but bounded above by the minimum of the per-mode scores when that minimum falls below a structural floor. This structural property is what distinguishes multimodal evaluation from a simple weighted average and is the property on which the patent's gating guarantee rests.

Simulation depth governs how far the action plan is symbolically executed before its post-conditions are evaluated. Shallow simulation is fast but may miss long-horizon contradictions; deep simulation is expensive but catches subtle policy violations. The depth parameter is selected per skill class according to the policy reference, with safety-critical classes requiring deeper simulation than informational ones.

Maximum admissible disagreement specifies the largest variance permitted across the per-mode evidence vector. When variance exceeds the bound, the skill is rejected even if the aggregate score would otherwise pass. This is the mechanism by which the pipeline detects modality-specific hallucination, in which a generator produces internally consistent but cross-modally inconsistent justifications.

Alternative Embodiments

The pipeline admits embodiments in which the modality count is greater than three. Embodiments incorporating proprioceptive evidence, social-context evidence, and historical-trace evidence as additional modalities are explicitly contemplated. The structural property that admission requires concordant evidence across all configured modalities is preserved regardless of count.

Embodiments differ in the choice of cross-modal aggregator. A min-aggregator embodiment rejects any skill whose weakest modality falls below threshold. A geometric-mean embodiment penalizes disagreement more aggressively than a linear average. A learned-aggregator embodiment trains the weights against a held-out audit set, subject to the constraint that the learned weights remain bounded by the policy reference's structural floors.

Embodiments also differ in the locus of evaluation. In a centralized embodiment, a single evaluation engine receives all three modalities and produces the admission decision. In a federated embodiment, per-modality evaluators operate independently and emit signed evidence tokens, and a separate aggregation engine combines the tokens. The federated embodiment supports deployments in which the perceptual subsystems are physically or organizationally distinct from the cognitive core.

A streaming embodiment evaluates skills incrementally as evidence arrives, admitting a skill provisionally on partial evidence and revoking the admission if subsequent evidence falls below threshold. A batched embodiment accumulates evidence until all modalities have reported and emits a single admission decision. The streaming and batched embodiments are structurally equivalent under the patent's claims; the distinction is operational.

Composition With Adjacent Mechanisms

The pipeline composes with the broader skill-gating architecture by emitting admission decisions that are consumed by the agent's capability registry. Admitted skills are added to the registry along with the per-mode evidence vector that justified their admission; rejected skills are returned to the proposer with a structured rejection report indicating which modalities failed and why.

Composition with the confidence-governance subsystem is direct. The aggregate admission score is exposed as a confidence input to downstream gating, so a skill admitted with a marginal score propagates lower confidence into actions that depend on it. Composition with the capability-genealogy subsystem is also direct: admitted skills inherit their parent capabilities' bounds, and the multimodal evidence vector is recorded as the genealogical justification for the derivation.

Distinction From Prior Art

Prior art in language-model evaluation typically uses single-modality benchmarks, in which a model's outputs are scored against a textual reference. Such benchmarks cannot detect skills that are textually plausible but environmentally unmotivated. Prior art in multimodal learning combines modalities at the representation level, producing a fused embedding that is then evaluated; this fusion destroys the per-mode signal on which the structural floor depends.

The pipeline's distinguishing structural property is that modalities are evaluated independently and aggregated under a constraint that prevents any single modality from carrying the decision. This property is the basis for the patent's claim of structural sufficiency and is not present in prior multimodal evaluation systems.

Implementation Considerations

A faithful implementation must preserve the per-mode independence of the evaluators. Evaluators that share state across modalities, whether through shared embeddings, shared attention layers, or shared decoding heads, undermine the structural floor and erase the cross-modal disagreement signal. The reference implementation runs each evaluator in a separate process whose only output is a signed evidence token, and the aggregator consumes the tokens without access to the evaluator internals.

The reference implementation also requires that the simulation engine used during action-plan evaluation be deterministic and seedable. Non-deterministic simulation produces non-reproducible admission decisions and breaks the audit guarantee. Where the underlying environment is itself stochastic, the simulation engine substitutes a deterministic surrogate parameterized by the policy reference, and the surrogate's outputs are recorded as part of the evidence vector.

Operationally, the pipeline interacts with rate-limiting and resource-governance subsystems. Each evaluator is bounded by a per-call resource budget, and exhaustion of the budget is treated as a per-mode failure rather than as a default-admit or default-reject. This treatment ensures that resource pressure cannot be exploited to coerce admission of an under-justified skill, and it ensures that an attacker cannot cause selective rejection of valid skills by inducing budget exhaustion in a single modality.

The pipeline also exposes a structured rejection report whose contents are themselves part of the audit trail. A rejected skill is returned to the proposer with the per-mode evidence vector, the failed-mode identifiers, and the policy-reference clauses that motivated the rejection. The report enables the proposer to refine future proposals without exposing the internal weights of the evaluators or the contents of the policy reference beyond what the report explicitly discloses.

Failure Modes And Mitigations

The pipeline is designed to fail safely under each of three structural failure modes. Modality dropout, in which one of the configured evaluators fails to report within its budget, is treated as a per-mode failure rather than as a default-admit, and the skill is rejected with a dropout-class rejection report. Modality coercion, in which an attacker induces concordant but fabricated evidence across all modalities, is mitigated by the structural requirement that each evaluator operate over independent inputs and emit signed evidence tokens whose signatures are validated by the aggregator. Aggregator compromise, in which the aggregator itself is induced to produce an admit decision absent supporting evidence, is mitigated by requiring that the aggregator's decision be reproducible from the lineage by an independent auditor.

Hallucination-class failures, in which a generative proposer emits a textually fluent but environmentally unmotivated skill, are detected by the cross-modal disagreement check rather than by any single evaluator. The structural property of the pipeline is that detection of such failures does not depend on the textual evaluator's recognizing the hallucination; it depends only on the absence of concordant evidence in the sensor and action-plan modalities. This property is what makes the pipeline robust against improvements in generative fluency that would otherwise degrade single-modality evaluators.

Disclosure Scope

This article discloses the structural mechanism, operating parameters, and alternative embodiments of the multimodal evaluation pipeline as defined in Chapter 7 of the cognition patent. The disclosure is sufficient to enable a person of ordinary skill in the art to construct an embodiment without reference to the patent's underlying implementation. The patent's claims govern all embodiments that incorporate the structural sufficiency property, regardless of the specific modalities, aggregators, or simulation strategies adopted.