Multi-Root Composite Lineage Graphs: Provenance Through Variance Vector Similarity

Nick Clark

Multi-Root Composite Lineage Graphs: Provenance Through Variance Vector Similarity

by Nick Clark | Published March 27, 2026 | PDF

Composite content — a montage that combines several photographs, a document that quotes from multiple sources, an audio mix assembled from independent stems — is created by drawing material from several upstream works. The provenance system disclosed in Provisional Application 63/808,372 records each such combination as a multi-root lineage graph in which the composite carries weighted edges to every component it draws from. The weights are computed from cosine similarity of structural variance vectors derived from the composite and its candidate predecessors, so the lineage of each component is anchored to the composite by a measurable structural relationship rather than by an external label. A composite cannot launder the provenance of its sources: if a component is missing from the lineage graph, the absence is detectable from the composite's own variance vector, and the composite is flagged as carrying an unattributed source.

Mechanism

The composite lineage mechanism operates on the structural variance vector of a candidate composite work. The vector is computed by the same anchoring procedure that assigns identity to atomic works: the content is partitioned into structural units appropriate to its modality, an variance measure is taken over each unit, and the resulting measures are concatenated and normalized to produce a fixed-dimension vector. For composite content, the vector reflects the contributions of all components: regions drawn from one source contribute their structural signature in proportion to their extent in the composite, regions drawn from another contribute theirs, and the result is a superposition that can be decomposed against a candidate set of predecessors.

Decomposition proceeds by computing the cosine similarity between the composite vector and the variance vectors of each candidate predecessor. Predecessors whose similarity exceeds a calibrated threshold are admitted as roots of the composite's lineage graph; an edge is written from the composite's unique identifier to each admitted predecessor's identifier, weighted by the similarity score. The result is a directed acyclic graph in which the composite has multiple parents, each parent edge carries a quantitative weight, and the sum of the weights characterizes the proportion of the composite's structure that has been attributed.

A composite is required to attribute its full structure. The mechanism computes a residual vector — the component of the composite's variance vector that is not explained by the admitted predecessors — and tests its norm against an admissibility threshold. A residual norm below the threshold indicates that the lineage graph accounts for the composite; a residual above the threshold indicates that one or more components are missing from the graph. In the latter case, the composite is marked as carrying an unattributed contribution and is flagged for further investigation. The flag travels with the composite's identity record and is itself part of the public lineage data.

The lineage graph is append-only and content-addressed. The composite's identifier is a hash of its variance vector together with the set of admitted parent edges and their weights; modifying the parent set or the weights produces a new identifier, so a composite cannot be silently re-attributed after publication. Verifiers receiving a composite recompute its variance vector, recompute the cosine similarities against the claimed parents, and verify that the residual is below the threshold; the verification is independent of any registry and can be performed offline.

Operating Parameters

Reference implementations partition image and video content into spatial-frequency bands and temporal segments respectively, partition audio into spectral bands and time windows, and partition text into syntactic and semantic units. Per-partition statistics are then composed into the lineage vector: structural variance of the partition's normalized scalar field (per the multi-axis variance pipeline described in the variance-vector article) provides the primary signal, optionally augmented by partition-level information density estimated as the Shannon entropy of the partition's symbol distribution where the partitioning is defined over a discrete alphabet. Vectors are normalized to unit length so that cosine similarity ranges over the closed interval from negative one to positive one and is invariant to overall scale. Vector dimensions in the reference implementation range from several hundred to several thousand, chosen to balance discrimination against the cost of similarity computation across large candidate sets.

The cosine-similarity admission threshold is calibrated empirically per modality and per intended use. For high-confidence attribution suitable for legal record, the threshold is set high enough that the false-positive rate against an unrelated corpus is negligible; for exploratory attribution suitable for content discovery, the threshold is relaxed. The mechanism records the threshold in force at the time of admission alongside the edge weight, so verifiers can reproduce the admission decision exactly.

The residual admissibility threshold is similarly calibrated and similarly recorded. It is chosen so that a composite drawn entirely from declared predecessors yields a residual below the threshold under expected modality-specific noise, while a composite that incorporates an undeclared component yields a residual above the threshold with high probability. The thresholds are not policy parameters set by the publisher of the composite; they are properties of the modality calibration and are signed by the calibration authority, so a publisher cannot relax them to admit a composite that would otherwise be flagged.

The candidate set against which a composite is decomposed is bounded by the deployment. In a closed corpus the candidate set is the corpus itself; in an open setting the candidate set is constructed by approximate nearest-neighbor search over an indexed set of variance vectors. The mechanism records the candidate set or its index version, so the lineage graph is reproducible.

Alternative Embodiments

The composite lineage mechanism admits several embodiments. A modality-uniform embodiment computes a single variance vector that spans all modalities present in the composite; a modality-partitioned embodiment computes separate vectors per modality and decomposes each independently, yielding a lineage graph with edges typed by modality. The partitioned embodiment is appropriate where components contribute in different modalities — for example, an image overlaid with text — because it isolates the contribution of each modality and prevents a strong contribution in one from masking a missing contribution in another.

An embodiment in which weights are normalized so that the admitted edges sum to a value approaching unity exposes residual contribution as the deficit from unity, providing an intuitive readout of unattributed content. An alternative embodiment retains raw cosine values and exposes residual contribution through the residual vector norm directly. Both embodiments support the same flagging behavior; the difference is presentational.

Hierarchical embodiments admit composites whose components are themselves composites. The mechanism recurses: the parent edge from a composite to a component composite is computed from cosine similarity of their respective variance vectors, and the verifier may follow the edge into the component composite's own lineage graph. The recursion terminates at atomic works whose lineage graph has no parent edges. The hierarchical embodiment is appropriate for content pipelines in which intermediate composites are themselves published.

Embodiments differ in how candidate predecessors are sourced. A registry embodiment maintains a published index of variance vectors and identifiers; a peer-discovery embodiment exchanges vectors among cooperating publishers; a federated embodiment partitions the index by jurisdiction or domain. The mechanism is indifferent to the source of candidates provided that each candidate carries a verifiable identity; the lineage graph records which index was consulted, so the attribution remains auditable.

Embodiments may further differ in the per-partition statistic. The primary signal is structural variance of the partition's normalized scalar field, computed using the multi-axis variance pipeline. Where additional discrimination is warranted the variance signal may be augmented by an information-density signal — Shannon entropy of the partition's symbol distribution in reference implementations, with Rényi or Tsallis entropy admissible where the modality calibration warrants. The choice of statistic and any auxiliary entropy estimator is recorded in the modality calibration profile, and vectors computed under different statistics are not interchangeable; the mechanism enforces this by including the calibration identifier in the vector's signature.

Composition with Other Mechanisms

Composite lineage composes with the atomic content-anchoring mechanism that assigns variance-derived identity to non-composite works. The same variance vectors that anchor atomic identity serve as the candidate predecessors against which composites are decomposed; no parallel data structure is required. The lineage graph extends the identity record from a singleton to a set of weighted parent edges, but the underlying anchoring procedure is unchanged.

Composite lineage composes with output-provenance fingerprinting in cases where one or more components of a composite are themselves the outputs of generative models. The provenance fingerprint of the generated component is recorded in its identity record, and the composite's lineage graph carries an edge to that record; verifiers traversing the lineage learn that the component is generative output and can apply policy to that fact. The composite cannot mask the generative origin of a component because the component's identity record is the same record that the lineage edge points to.

Composite lineage composes with downstream rights-management and licensing systems through the weighted edges. A licensing system that allocates payment in proportion to contribution can read the edge weights directly; a rights-management system that requires unanimous consent of all roots can enumerate the parent set. Because the weights are derived from a measurable structural relationship rather than from publisher assertion, the composition is resistant to manipulation.

Prior-Art Distinctions

Conventional provenance systems record component-of relationships through publisher assertion: the publisher of a composite declares which works it incorporates, and the system stores the declaration. The trust model is that the publisher is honest, and the failure mode is that an undeclared component is undetectable from the composite alone. The disclosed mechanism replaces publisher assertion with a measurable structural relationship between the composite and its candidate predecessors; an undeclared component is detectable as a residual in the composite's own variance vector.

Watermarking and steganographic provenance systems embed a marker in the content that identifies its source. Watermarks survive certain transformations but are removable by adversarial processing, and they presume that every relevant source applies a watermark. The disclosed mechanism does not depend on a marker: the variance vector is computed from the content's own structure, and the relationship to predecessors is computed from the vector. A source that did not voluntarily mark its content is still detectable as a contributor to a composite.

Perceptual hashing systems compute a fingerprint from content and use it for near-duplicate detection. The disclosed mechanism extends beyond near-duplicate detection in two respects: the variance vector is structured for decomposition rather than for direct comparison, so a composite that is not a near-duplicate of any single predecessor is nonetheless decomposed into its predecessors; and the multi-root graph records the result of the decomposition, so the provenance is not merely a list of similar items but a weighted attribution.

Graph-based provenance standards permit multi-parent records but rely on external assertion to populate the parents. The disclosed mechanism populates the parents from a measurement, and validates the population by residual analysis, so the lineage graph is not merely permitted to be multi-rooted but is required to be complete with respect to the composite's measurable structure.

Disclosure Scope

This article describes the multi-root composite lineage mechanism as disclosed in Provisional Application 63/808,372 covering content anchoring through structural variance analysis. The disclosure encompasses the computation of variance vectors for composite content, the decomposition of those vectors against candidate predecessor vectors via cosine similarity, the construction of weighted multi-parent lineage edges, the residual analysis used to detect unattributed components, and the content-addressed identity of the resulting lineage graph.

The scope extends to embodiments that vary the modality partitioning, the variance estimator, the similarity measure, the weight normalization, the recursion behavior over composite predecessors, and the source of candidate predecessors, provided that the resulting lineage graph is constructed from measured structural relationships and that residual analysis is used to detect missing components. Specific publishing, licensing, and discovery applications built atop the lineage graph are out of scope except as exemplars.

Claims arising from the disclosure cover the structural arrangement and its enforcement consequences, including the inability of a composite to launder the provenance of its sources, the auditability of per-component attribution, and the verifiability of the lineage graph independent of any central registry. Implementations practicing one or more of these features in combination fall within the claim scope regardless of the modality of the content or the application domain in which the composites are produced.