Mechanism
Multi-modal content identity in this disclosure is not a separate pipeline per media type and is not a cross-modal agreement test. It is a single extraction pipeline applied to every modality. A modality classifier determines the content type of an input artifact and routes it to a modality-specific normalization path. Each path produces the same kind of output: a bounded, two-dimensional scalar field of normalized values. The image path produces a grayscale scalar field; the audio path computes a short-time Fourier transform, applies a mel filterbank, and produces a mel-spectrogram scalar field; the text path applies TF-IDF weighting and positional grid mapping to produce a token frequency scalar field; the video path extracts per-frame scalar fields and computes a temporal delta vector. All paths converge on the shared multi-axis variance vector extraction stage, which produces a variance vector regardless of the source modality.
The variance vector is a nine-dimensional vector organized into three structural axes, designated X, Y, and Z, encoding cross-scale energy distribution, cross-scale frequency compaction, and structural phase persistence based on gradient orientation distribution. Because every modality reduces to a normalized scalar field before extraction, the multi-scale variance flow, gradient histogram, and edge density computations operate identically across modalities. The consequence is that audio, text, video, binary objects, vector graphics, and tabular data become directly comparable to images and to each other through the same cosine similarity operator over their variance vectors. The identity of a content object is its position in this continuous variance space, not a payload embedded in the object and not a manifest attached to it.
Audio Normalization
For audio waveform artifacts, the normalization procedure computes a time-frequency representation of the waveform using a short-time Fourier transform with a Hann window of approximately 2048 samples and a hop length of approximately 512 samples, producing a two-dimensional magnitude spectrogram. The spectrogram is mapped to the mel frequency scale using a filterbank of 128 mel bins spanning the frequency range of the artifact's sample rate, and normalized to a canonical resolution of 256 time frames by 128 frequency bins, with log-magnitude scaling applied to compress dynamic range. This normalized mel-spectrogram serves as the scalar field input to the extraction pipeline.
The variance-based proxy over this field captures frequency energy distribution across time-frequency cells; the gradient histogram captures transitions between frequency regions and temporal onset patterns; the edge density metric captures the structural complexity of the spectral profile. The resulting nine-dimensional variance vector encodes audio texture, onset density, harmonic richness, and spectral centroid behavior in a form directly comparable to image-domain variance vectors through the same cosine similarity operator. A temporal delta vector for audio encodes the cross-frame cosine similarity between consecutive short-time spectral windows, providing a compact representation of temporal dynamics suitable for clip-level UID derivation.
Text Normalization
For textual document artifacts, the normalization procedure maps the document to a two-dimensional token frequency scalar field. Each distinct token in the document vocabulary is assigned a positional index along one axis and a frequency-weighted salience value along the other, computed as the product of the token's term frequency within the document and the inverse of its document frequency within a reference corpus. The matrix is normalized to a canonical resolution of 256 by 256 cells, with tokens assigned to cells by positional index and cell values representing aggregated salience scores.
Byte-level variance is computed as a supplementary signal across sliding windows of 64 bytes of the document's UTF-8 encoded representation, producing a secondary scalar profile blended with the token frequency field variance at a configurable weight. The gradient histogram over the token frequency field captures the distributional sharpness of vocabulary usage, the concentration of semantic content, and the regularity of positional token patterns, providing a structural fingerprint sensitive to document genre, authorship style, and compositional density. The resulting variance vector is directly operable by the slope-band assignment, anchor registration, and lineage graph construction methods used for every other modality.
Video: Two-Level Derivation
For video artifacts, the UID derivation system operates at two levels. At the frame level, each frame is treated as a raster image artifact and processed through the full image extraction pipeline, including canonical normalization, orientation canonicalization, global variance vector extraction, quadrant decomposition, and optionally structure and constellation signature computation. At the clip level, a temporal delta vector is derived by computing the cosine similarity between consecutive frame variance vectors and recording the resulting similarity scores as a one-dimensional temporal profile.
The temporal delta vector encodes the rate and magnitude of variance change across the clip, capturing scene transitions, motion intensity, and compositional rhythm. A clip-level UID is derived by applying the multi-axis extraction pipeline to the temporal delta vector treated as a one-dimensional signal, producing a variance vector that encodes the clip's dynamic structure and can be registered with anchor nodes for clip-level identity, lineage tracing, and similarity matching. This two-level architecture supports both frame-level deduplication and clip-level provenance tracking, enabling detection of partial reuse, remix, and cross-format reformatting across video content ecosystems.
Binary, Vector, and Tabular Modalities
For binary object artifacts, including executable files, compiled archives, container payloads, and source code, the normalization procedure maps the byte sequence to a two-dimensional scalar field by reshaping the byte stream into a square or near-square matrix at a canonical resolution and computing per-cell statistics from sliding-window byte variance, byte-frequency entropy approximated as the variance of byte-frequency counts within the window, and structural-section profile values where the artifact carries a recognized container structure such as a portable executable, an executable and linkable format, or an archive index. The resulting field is processed by the same extraction pipeline, producing a UID that supports cosine-similarity comparison across recompiled, repacked, or partially patched variants of the same underlying payload.
For vector graphics artifacts, the normalization procedure rasterizes the vector content at a canonical resolution using a deterministic rasterization profile and processes the resulting raster field through the image extraction pipeline, optionally augmented by a path-density scalar field derived from cumulative path length and control-point density within each cell. For structured tabular artifacts, the procedure projects column-wise statistical descriptors, including per-column variance, cardinality, and inter-column correlation, into a two-dimensional cell grid indexed by column position and statistical descriptor type. These normalization recipes are illustrative: any normalization that produces a bounded, two-dimensional scalar field of normalized values may serve as input to the extraction pipeline without modification to the extraction, hashing, slope-band assignment, anchor governance, or lineage construction components.
Streaming Content
For real-time streaming content, including live video broadcasts, audio streams, and continuous sensor data, the system operates over a sliding window of the stream rather than over a discrete artifact. A sliding window of configurable duration, for example 10 seconds for audio or 30 frames for video, is extracted, normalized, and processed through the extraction pipeline to produce a window-level UID. Consecutive window UIDs are compared by cosine similarity to detect structural continuity or discontinuity in the stream, and each window-level UID is registered with the anchor network for real-time provenance tracking.
When the cosine similarity between consecutive window UIDs falls below a configured continuity threshold, the system records a scene transition event in the anchor's event log. When the cosine similarity between a window-level UID and a registered reference UID exceeds the policy-declared similarity threshold, the system generates a real-time match event that may trigger policy enforcement actions including blocking of unauthorized retransmission, generation of a consultation event record, or invocation of the pre-release admissibility engine. This architecture enables live broadcast monitoring and continuous provenance tracking across streaming content platforms without requiring offline batch processing.
Synthetic-Content and Recapture Signals
The same variance representation supports adversarial robustness through structural lineage rather than through a cross-modal agreement test. A generatively synthesized artifact has no structural lineage to any prior registered artifact in the anchor network: its variance vector position in slope space reflects the statistical properties of the generative model's output distribution rather than the variance profile of any specific prior artifact. The lineage query module queries the anchor network for registered parent UIDs within a configured slope continuity radius; if none falls within the radius, the orphan detector classifies the artifact as structurally unanchored. Such artifacts are not necessarily fraudulent, but they cannot be admitted under a policy object that requires verifiable provenance, and they trigger heightened scrutiny under policy objects governing synthetic content.
The screenshot recapture classifier exploits a characteristic variance signature introduced when a digital display renders an image and a camera or screen-capture device re-captures the rendered output. Screen rendering introduces a periodic spatial frequency structure attributable to sub-pixel geometry, compression and dithering artifacts, and the optical point-spread function of the capturing lens. These manifest in the Z-axis gradient histogram component as elevated energy in the horizontal and vertical orientation bins relative to the diagonal bins, producing a horizontal-vertical bias score evaluated against a policy-calibrated threshold. The synthetic content detector separately compares the candidate's variance vector against a slope-band-indexed statistical model of the variance vector profiles of known synthetic content. Each signal operates over variance-derived UIDs alone, without reference to the original artifact and without model introspection.
Disclosure Scope
The multi-modal content identity mechanism described here, comprising the modality classifier that routes an artifact to a modality-specific normalization path, the normalization procedures for audio mel-spectrograms, text token-frequency grids, video frame-level and clip-level derivation, binary, vector graphics, and tabular artifacts, the shared multi-axis variance vector extraction pipeline producing a nine-dimensional variance vector comparable across modalities by cosine similarity, the sliding-window streaming extension with continuity and reference-match thresholds, and the orphan-detection, screenshot-recapture, and synthetic-content signals derived from the same variance representation, is disclosed in PCT International Application No. PCT/US26/28630. This article describes that disclosed mechanism. The disclosure states that any normalization producing a bounded, two-dimensional scalar field of normalized values may serve as pipeline input, so the scope extends to modalities and scalar-field projections not enumerated, provided they preserve structural variation across cells and feed the same extraction, hashing, slope-band assignment, anchor governance, and lineage construction components without modification.