Mechanism
The multimodal evaluation pipeline is the subsystem responsible for acquiring, processing, scoring, and classifying evidence from multiple sensory modalities simultaneously. It is the evidential foundation on which the capability gate, the curriculum engine, and the anti-gaming measures rest. The pipeline is a multi-stream architecture: each modality produces an independent evaluation signal, and the composite evaluation is derived from the convergence or divergence of those independent signals rather than from any single stream.
The pipeline does not assess a language model's proposal. It assesses a learner or operator: a human (or composite system) whose accumulated performance evidence determines whether a capability gate opens. Each input stream is processed by a modality-specific evaluation module that produces a structured score vector, a set of per-dimension assessments relevant to that modality. The composite is then computed across those score vectors. This separation, independent per-modality scoring followed by cross-modality fusion, is what lets the pipeline detect disagreement between what a person reports and what their other channels reveal.
Input Streams
The pipeline supports five categories of input stream. Text-based input includes typed responses, structured form submissions, and natural language interaction transcripts. Audio-based input includes spoken responses, vocal prosody analysis, and environmental audio capture. Video-based input includes facial expression analysis, body posture and gesture recognition, manual task execution observation, and gaze tracking. Sensor-telemetry-based input includes force-torque measurements from equipment interaction, position and velocity data from motion capture, vehicle dynamics data from onboard sensors, and environmental condition measurements. Biometric-based input includes heart rate and heart rate variability, galvanic skin response, electroencephalographic signals where available, and respiration rate.
Each stream is processed by its own modality-specific evaluation module. The module produces a structured score vector whose dimensions are those relevant to that modality, not a single scalar. The pipeline therefore preserves, rather than collapses, the distinct information each channel carries, so the fusion stage can reason about the relationship between channels.
The Fusion Engine
The composite evaluation is computed from the per-modality score vectors by a fusion engine that applies configurable weighting rules. The fusion engine evaluates the inter-modality consistency and produces a composite that accounts for both the individual modality signals and the degree to which those signals corroborate one another. The composite is not a simple average of the streams.
The disclosure gives a concrete case. A learner who achieves high accuracy on text-based assessments, but whose biometric signals indicate elevated stress and cognitive overload, receives a composite evaluation that reflects the tension between the performance signal and the physiological signal, rather than a composite that averages the discrepancy away. The point of multi-stream fusion is to surface that tension, not to dissolve it into a single blended number.
Continuous Identity Verification
The pipeline implements trust-validated identity checks at each evaluation point. Before any evaluation evidence is incorporated into a learner's progression record, the pipeline verifies that the individual producing the evidence is the individual whose progression record will be updated. This verification may use the biological identity system described elsewhere in the filing, device-bound authentication, or continuous behavioral biometric verification during the session.
Identity verification is not a one-time gate at the start of a session. It is a continuous process that re-verifies identity at configurable intervals throughout the evaluation, so that mid-session substitution or mid-session assistance can be detected rather than slipping through after an initial check has passed.
Multimodal Evidence as Anti-Gaming Substrate
The multimodal evidence captured by the pipeline serves a second function beyond enriching assessment: it is the structural medium through which the system detects and invalidates attempts to manipulate capability gating decisions. The anti-gaming function operates through four mechanisms.
The first is cross-modality consistency enforcement. When a learner's text-based responses indicate mastery but the learner's physiological signals indicate confusion, distraction, or reliance on external assistance, the inconsistency is detected and flagged. Expert-level textual analysis produced alongside physiological markers of cognitive overload, such as elevated heart rate, increased galvanic skin response, or prolonged gaze fixation indicative of reading from an external source, is flagged for review. The flag does not automatically invalidate the assessment; it triggers additional verification and is recorded in the progression record as a data point the capability gate considers when weighing the credibility of the mastery evidence.
The second is temporal pattern analysis. The system analyzes the temporal dynamics of responses across modalities to detect coaching, remote assistance, or automated response generation. A response-latency distribution that is bimodal, with slow responses correlated to higher difficulty, may indicate intermittent external assistance during the slow intervals. Keystrokes with uniform timing inconsistent with natural human typing may indicate an automated response tool. Such deviations from expected multimodal temporal dynamics down-weight the mastery evidence.
The third is spoofing detection. The pipeline detects attempts to substitute a different individual's performance for the registered learner's, leveraging the continuous identity verification above and augmenting it with behavioral biometric continuity analysis. If typing dynamics, vocal characteristics, or movement patterns diverge from the registered learner's established behavioral profile, a potential substitution event is flagged. This operates on behavioral signals and is complementary to the biological identity verification described elsewhere in the filing.
The fourth is language model proposal down-weighting. When the pipeline detects evidence of gaming, the trust weight assigned to language model proposals that reference the compromised evidence is reduced. If a language model proposes a capability unlock based on mastery evidence that the anti-gaming substrate has flagged, the reduced trust weight causes the arbitration engine to prefer alternative proposals or to reject the unlock proposal entirely.
Composition With Adjacent Subsystems
The pipeline is the evidential layer beneath the rest of the skill-gating architecture. Its composite evaluations and its flagged inconsistencies feed the capability gate, which evaluates accumulated evidence against defined competency thresholds and produces a binary determination to open or remain closed. Because the gate operates continuously, the pipeline's ongoing output can also drive revocation: if continued evidence indicates competence has degraded, a previously opened gate can close.
The pipeline also supplies the evidence corpus that backs certification token issuance and feeds the security architecture's drift detection, in which aging evidence and evidence produced under conditions that no longer obtain are progressively down-weighted. In embodied domains such as vehicle, robotics, and industrial operation, the same pipeline ingests domain-specific telemetry and video to drive continuous, real-time competence evaluation rather than a static profile.
Distinction From Conventional Assessment
Conventional competency assessment scores a single channel, most often a text response, against a reference. A single-channel score cannot detect a learner who produces a correct textual answer while their physiological and behavioral channels reveal that the answer did not originate from their own present competence. By keeping each modality's score vector independent and fusing them under consistency-aware rules, the pipeline makes that discrepancy a first-class signal.
It is also what distinguishes the pipeline from any fusion that collapses the channels before scoring. Collapsing the channels into a single blended figure discards the per-modality signal on which cross-modality consistency enforcement depends. Here the modalities are scored separately and only then combined, so a disagreement between performance and physiology, or between performance and behavioral continuity, remains visible to the gate.
Disclosure Scope
The multimodal evaluation pipeline, comprising the five input-stream categories (text, audio, video, sensor-telemetry, and biometric), the modality-specific evaluation modules that each produce a structured score vector, the fusion engine that computes a consistency-aware composite from those vectors, the continuous trust-validated identity verification performed at each evaluation point, and the four-mechanism anti-gaming substrate (cross-modality consistency enforcement, temporal pattern analysis, spoofing detection, and language model proposal down-weighting), is disclosed in the cognition filing (U.S. Application No. 19/647,395 and its international counterpart). This article describes that disclosed mechanism. The scope extends to embodiments in which the modality set, the fusion weighting rules, and the identity-verification basis differ, provided the per-modality signals are scored independently and combined under consistency-aware fusion that feeds the capability gate.