320-Bit UID Construction: Multi-Segment Hashing for Negligible Collision Probability

by Nick Clark | Published March 27, 2026 | PDF

A content unique identifier (UID) is constructed from canonical content, the governance class associated with the content's origination, and the lineage path connecting the content to its anchoring root. The construction yields a 320-bit identifier composed of five 64-bit segments: a global hash over the canonical content, and four sorted quadrant hashes that bind the content to its structural decomposition. The resulting UID is portable across substrates, tamper-evident under any modification to the underlying content or to its declared governance class, and computable identically by any party in possession of the content and the construction parameters. This article expands the UID construction primitive to white-paper depth, treating it as the structural identity layer of the content anchoring architecture. The discussion proceeds through the construction mechanism, its operating parameters, its alternative embodiments across content classes and deployment topologies, its compositional surface against the broader anchoring architecture, the prior-art distinctions that define its novelty, and the formal scope of disclosure intended for enablement and claim support.


Mechanism

The UID construction mechanism takes as input a canonical representation of the content, a governance class identifier, and a lineage segment that anchors the content to its origination. The output is a fixed-size 320-bit identifier whose internal structure preserves enough information about the input to permit several distinct verifications without disclosing the input itself. The construction proceeds in three stages: canonicalization, segmentation, and segment hashing, each of which contributes a defined slice of the final identifier.

Canonicalization reduces the content to a deterministic byte sequence that is invariant under presentation differences. For text, canonicalization specifies normalization form, line ending convention, and whitespace handling. For media, canonicalization specifies the perceptual feature set extracted from the content and the encoding under which those features are serialized. For structured records, canonicalization specifies field order, encoding, and elision rules for absent fields. The canonical representation is the only artifact that participates in the subsequent hash steps; two byte sequences that canonicalize to the same form yield the same UID, and two that canonicalize differently yield different UIDs with overwhelming probability.

Segmentation partitions the canonical representation into four ordered quadrants. The segmentation rule is content-type-specific: textual content is partitioned by token offset into four roughly equal regions, image content is partitioned spatially into four quadrants of equal area, and structured records are partitioned by a deterministic ordering of fields into four groups. The segmentation rule is part of the construction parameters and is committed alongside the resulting UID, so any subsequent verifier can reconstruct the partition without ambiguity.

Segment hashing produces five independent 64-bit hash values: one global hash over the entire canonical representation, and one quadrant hash over each of the four quadrants. The global hash binds the UID to the content as a whole; the quadrant hashes bind the UID to the structural composition of the content and permit localized tamper detection. The four quadrant hashes are sorted in ascending numeric order before concatenation, so that the UID is invariant under quadrant relabeling. This sorting is what permits a UID to identify a content composition independently of an arbitrary quadrant naming convention while still being sensitive to the actual quadrant contents.

The five segments are concatenated to form a 320-bit identifier. The governance class identifier and the lineage segment are not directly embedded in the UID bytes; instead, they are bound to the UID through a separate commitment record that references the UID and is itself committed to the anchoring lineage structure. This separation permits the UID to be used as a content identity independent of any particular governance regime, while still allowing the governance binding to be recovered when needed by following the commitment record from the UID.

Tamper-evidence follows directly from the construction. Any modification to the canonical content changes at least one of the five hash segments with overwhelming probability, and the change is localized to the segment whose underlying quadrant was modified. A verifier presented with the original UID and a modified content sample can therefore not only detect that a modification occurred but also identify which quadrant was affected, without needing access to the original content beyond the UID itself. This locality property is a deliberate design choice: it permits a verifier to bound the scope of a tampering event without performing a full content comparison, and it permits a content owner to publish only the affected quadrant hash as evidence of intactness for unaffected regions.

The construction is deterministic and parallelizable. Each of the four quadrant hashes is computed independently of the others, so a multi-core implementation may compute all four in parallel with linear speedup up to the number of quadrants. The global hash, which proceeds over the entire canonical representation, may overlap with quadrant hashing on independent execution units. The sorting step that establishes quadrant-relabeling invariance is a fixed-cost operation over four 64-bit values and does not contribute meaningfully to total construction cost. Determinism requires only that the canonicalization, segmentation, and hash function configurations be identical across implementations; given matching configurations, two implementations on different hardware architectures will produce byte-identical UIDs over the same input.

The 320-bit width is chosen as a structural balance between collision resistance, transport cost, and the four-quadrant decomposition. Five 64-bit segments provide sufficient variance in the global hash to dominate the collision bound at any practical content population, while the four quadrant hashes contribute cumulative collision resistance that is not load-bearing for global uniqueness but is essential for localized tamper detection. A wider identifier would not improve the practical tamper-detection or uniqueness guarantees materially; a narrower identifier would compromise the locality property by forcing quadrant hashes into truncations too short to discriminate between content variations.

Operating Parameters

The collision probability of the 320-bit UID is determined by the underlying hash function and the segment structure. Under the standard assumption that the hash function is indistinguishable from a random oracle, the collision probability for the global hash alone is approximately 2^(-32) at a population of 2^32 distinct contents, by the birthday bound. The four sorted quadrant hashes contribute additional collision resistance: a collision in the full 320-bit UID requires simultaneous collision in the global hash and in each of the four quadrant hashes, an event whose probability is far below any threshold of practical concern at any population size achievable in the foreseeable future of digital content.

The hash function used in each segment is a parameter of the construction rather than a fixed primitive. Constructions are disclosed using SHA-256 truncated to 64 bits per segment, BLAKE3 truncated similarly, and dedicated 64-bit hash functions tuned for the canonical representation type. The choice does not affect the structural properties of the UID; it affects only the computational cost of construction and the strength of the collision resistance in absolute terms.

Construction cost scales linearly with the size of the canonical representation. The global hash makes one pass over the canonical bytes; each quadrant hash makes one pass over its quadrant. The total work is therefore approximately twice the work of a single hash over the content. For typical content sizes, construction completes in microseconds on commodity hardware, and the cost is dominated by canonicalization rather than by hashing.

The UID is byte-portable. Its 320 bits are encoded for transport and display in a fixed format, typically a base32 or hex string of fixed length. The format is independent of any substrate's storage layout, so a UID computed on one substrate can be recognized and verified on any other substrate that implements the same construction parameters. This portability is a deliberate architectural choice that makes the UID a content-layer identity rather than a substrate-layer identity.

The lineage binding is recorded in a separate commitment that references the UID and the lineage segment. The commitment is itself committed to the anchoring lineage structure, so the binding is tamper-evident under the same cryptographic guarantees as the UID. Multiple commitments may reference the same UID under different lineage segments, supporting the case in which the same content participates in multiple lineages without requiring the content to have multiple identities.

The governance class binding is similarly recorded as a separate commitment. A UID may be bound to one governance class in one commitment and to a different governance class in another, supporting the case in which the same content is governed differently in different contexts. The commitments are mutually exclusive at the substrate level only when explicit policy demands it; in the default case, multiple bindings coexist, and any reader may select the binding relevant to its context.

Construction parameters themselves are committed to a versioned configuration record referenced by the UID's surrounding metadata. A reader who possesses the UID and the configuration reference can recover the exact canonicalization, segmentation, and hash configuration used at construction time, and can recompute the UID over a presented content sample to verify identity. Configuration evolution is supported through versioning: a new configuration version is registered alongside the prior version, and content originated under the new version carries a configuration reference distinct from content originated under the prior version. This permits the architecture to adapt to advances in cryptographic primitives or to changes in canonical-form standards without invalidating prior identifiers.

Verification cost is symmetric with construction cost: a verifier reproduces the canonicalization, segmentation, and hashing steps, then compares the resulting UID to the presented one. Comparison is constant-time and does not depend on content size beyond the reproduction phase. For partial verification, a verifier in possession of the UID and a single quadrant of content may recompute only that quadrant's hash and compare it to the corresponding segment of the UID, after first identifying the matching segment by exhaustive comparison against the four sorted quadrant slots. This partial-verification pathway is what enables the locality property described in the Mechanism section to be exercised in practice.

Alternative Embodiments

In a textual content embodiment, canonicalization applies Unicode normalization form NFC, normalizes line endings to LF, collapses whitespace runs to single spaces, and lowercases according to a published locale rule. Segmentation partitions by token offset into four roughly equal regions, with a deterministic rule for boundary tokens that span quadrant edges. The resulting UID identifies the content up to presentation differences while remaining sensitive to substantive textual change.

In a media content embodiment, canonicalization extracts a perceptual feature vector from the source media, serializes the vector under a published encoding, and applies the construction to the serialization rather than to the raw media bytes. Segmentation partitions the feature vector by perceptual region: spatially for image content, temporally for audio and video. This embodiment produces UIDs that survive presentation-layer transformations such as recompression and resampling, while remaining sensitive to perceptual change in the underlying content.

In a structured record embodiment, canonicalization specifies a field ordering, an encoding, and an elision rule for absent fields. Segmentation partitions the canonical representation into four field groups by a deterministic field-to-group mapping. This embodiment is suited to records whose identity is defined by content rather than by storage layout, and it permits the same record to be hashed identically across systems with different internal representations.

In a composite content embodiment, the construction is applied recursively: each sub-content has its own UID, and the parent content's canonical representation includes the sub-content UIDs in lieu of the sub-content bytes themselves. The resulting parent UID binds the parent to its sub-contents by their identities, supporting Merkle-style structural verification across content hierarchies.

In a streaming embodiment, the canonical representation is constructed incrementally, and the four quadrant hashes are computed from running state rather than from a complete canonical representation. The global hash is finalized at stream end, and the four quadrant hashes are finalized at the boundaries identified by the segmentation rule. This embodiment supports content that is too large to materialize fully in memory, with no change to the resulting UID compared to a non-streaming construction over the same canonical bytes.

In a privacy-preserving embodiment, the canonical representation is salted with a content-owner-controlled secret before hashing. The resulting UID is computable only by parties in possession of the salt, and the same content under different salts yields different UIDs. This embodiment supports the case in which content identity should be verifiable by authorized parties but not enumerable by unauthorized observers.

In a multi-party-witness embodiment, construction is performed at multiple substrates that observe the same content, with each substrate independently computing a UID from its local canonicalization. The resulting UIDs are committed to a shared lineage record that demonstrates byte-equality across witnesses. Disagreement among witnesses is itself an event that the architecture surfaces, supporting forensic identification of content drift, canonicalization-rule divergence, or implementation defect. This embodiment is appropriate where content identity must be cross-validated as a precondition to consequential action, such as legal evidence handling or regulatory archival.

Composition

The UID composes with the broader content anchoring architecture along three axes. Along the lineage axis, the UID is referenced by lineage commitment records that bind the content to its anchoring root. The lineage record is itself committed to an append-only structure, so the chain from anchor to UID to content is end-to-end tamper-evident. A reader presented with a lineage record can verify the UID against the content and the anchor against the lineage structure in two independent steps.

Along the governance axis, the UID is referenced by governance commitment records that bind the content to a governance class. The governance binding may be revised over time, with each revision recorded as a new commitment referencing the same UID. The UID itself is invariant under governance revision, preserving content identity across governance lifecycle events. This separation is what permits the same content to circulate under different governance regimes in different contexts without requiring the content to have multiple identities.

Along the substrate axis, the UID is independent of any particular substrate's storage or transport layer. The same UID is recognizable on a centralized cloud substrate, a federated multi-party substrate, a fully decentralized substrate, and an edge substrate. This substrate-independence is what makes the UID a portable identity capable of surviving substrate migrations and cross-substrate references. A content identified on one substrate can be referenced from another substrate without any translation or re-identification step.

Composition with the structural variance analysis layer of the content anchoring architecture is particularly tight. The four quadrant hashes that participate in the UID are derived from the same structural decomposition that the variance analysis operates on, so the UID and the variance signature are aligned by construction. A reader who has access to both can correlate the variance signature with the quadrant hashes to localize structural features within the content, supporting downstream analysis without re-decomposition.

Composition with the alias system permits human-readable names to be bound to UIDs through alias publication. An alias is a name in a namespace bound to a UID; the binding is itself a commitment record. Multiple aliases may reference the same UID, and aliases may be revised over time without affecting the UID itself. This decoupling is what permits content identity to be stable while presentation names evolve.

Prior-Art Distinction

Conventional content identification systems address the identity problem through metadata tags, watermarks, and centralized registries. Metadata tags are external annotations that travel alongside the content but can be stripped, modified, or replaced without affecting the content itself; the identity therefore detaches from the content under common operations such as recompression or format conversion. Watermarks are embedded in the content but degrade under transformation and can be removed by adversaries with access to the content. Centralized registries assign identifiers and maintain mappings to content, but the identifier-to-content binding depends on the registry's continued availability and integrity.

The UID construction described here removes these dependencies by deriving the identifier from the canonical content itself. There is no metadata that can be stripped, no watermark that can be removed, and no registry whose loss invalidates the binding. Any party in possession of the content and the construction parameters can recompute the UID and verify identity, and any modification to the content is detectable by direct comparison of UIDs.

Cryptographic content hashes such as SHA-256 over raw content bytes provide a partial analog but lack the segmented structure that permits localized tamper detection and the canonicalization step that ensures invariance under presentation differences. Two byte-different but semantically identical contents will yield different SHA-256 hashes, while the UID construction described here yields the same UID. This canonicalization invariance is essential for content identity to survive normal lifecycle operations.

Perceptual hashing systems provide invariance under perceptual transformations but typically yield short identifiers with high collision rates and are not suitable as global identities. The UID construction described here combines perceptual canonicalization with cryptographic-strength hashing to produce identifiers with both invariance and global uniqueness, neither of which the prior systems achieve simultaneously.

Merkle tree constructions provide a structural verification mechanism over hierarchical content but do not specify a canonicalization step or a sorted-quadrant aggregation that produces a fixed-size identifier independent of the content's hierarchical structure. The UID construction here operates at a single level over a four-quadrant decomposition and produces a 320-bit identifier of fixed size, which is a different and more constrained primitive suited to the content identity role.

The combination of canonicalization, four-quadrant sorted segmentation, 320-bit identifier size, governance and lineage binding through separate commitments, and substrate-independent portability distinguishes this construction from each of the cited references.

Disclosure Scope

This article describes the UID construction at a level sufficient to enable a person of ordinary skill in cryptographic content identification to implement it. The disclosure encompasses the canonicalization step, the four-quadrant segmentation rule, the five-segment hashing, the sorted-quadrant aggregation, the 320-bit identifier size, the separate governance and lineage commitments, and the substrate-independent portability. Embodiments are disclosed across textual, media, structured, composite, streaming, and privacy-preserving content classes.

The disclosure is non-limiting with respect to the choice of hash function, the specific canonicalization rules for content classes not enumerated, the encoding format for transport, and the storage backend for the lineage and governance commitments. Variations in these implementation details are within the scope of the disclosure provided that the structural properties of canonicalization-based identity, four-quadrant segmentation, sorted aggregation, and separate commitment binding are retained.

This article is part of a series describing the content anchoring architecture through structural variance analysis. Related disclosures cover the structural variance decomposition, the lineage commitment structure, and the alias publication system; together they constitute the full architectural specification for content identity and provenance.

Nick Clark Invented by Nick Clark Founding Investors:
Anonymous, Devin Wilkie
72 28 14 36 01