Stable Sketching and Helper Data for Biological Features

Nick Clark

Stable Sketching and Helper Data for Biological Features

by Nick Clark | Published March 27, 2026 | PDF

A stable biological sketch is a compact, low-entropy projection of a noisy biometric signal that survives sensor swaps, illumination changes, posture variation, and ordinary physiological drift. Disclosed herein is a sketching mechanism in which a feature vector derived from a biological source is bound to publicly storable helper data, such that subsequent observations of the same source reproduce an identical binary sketch despite measurement noise, and such that the helper data discloses no exploitable information about the underlying signal. The sketch suffices for identity continuity across modalities and time; the helper data carries the burden of noise tolerance.

Mechanism

The sketching mechanism comprises three coupled stages: feature extraction, banded quantization, and helper-data generation. In the first stage, a raw biological observation, which may be a fingerprint ridge map, an iris texture, a vascular pattern, a cardiac waveform, a gait time series, or any combination thereof, is reduced to a feature vector of fixed dimensionality. Feature extraction is performed by a deterministic transform whose output distribution is characterized in advance, so that the variance of each feature dimension is known and bounded under the expected acquisition envelope.

In the second stage, each feature dimension is quantized by assignment to a band. A band is a contiguous interval of the feature axis whose width is selected as a function of the measured noise standard deviation along that axis. Each band carries a binary label, and the concatenation of band labels across all dimensions constitutes the sketch. The banding is constructed so that the median feature value during enrollment falls at the center of a band; this maximizes the margin between the enrollment value and the nearest band boundary, and therefore maximizes the probability that subsequent observations land in the same band.

The choice of feature transform is constrained by three structural requirements. First, the transform must be deterministic, so that identical raw inputs produce identical feature vectors; stochastic transforms violate the reproducibility on which the helper-data construction depends. Second, the transform must be approximately equivariant under the nuisance variables that practical deployment introduces, so that a sensor swap or illumination change shifts the feature vector by an amount small relative to the band width rather than by an amount that crosses bands. Third, the transform must concentrate informational content in a fixed set of dimensions, so that the per-dimension noise envelope can be characterized once and reused across deployments rather than re-estimated per subject. Transforms satisfying these requirements include band-pass projections of cardiac waveforms, Gabor-filtered iris codes, minutia-graph embeddings of fingerprint ridges, and learned encoders whose training objective explicitly penalizes nuisance sensitivity.

In the third stage, helper data is computed as the offset, modulo band width, between the enrollment feature value and the band center. The helper data is published or stored alongside the identity record. At verification time, the offset is added to the observed feature value before quantization; this snaps the observation to the band center that was used at enrollment, restoring band alignment even when the observed value has drifted toward a boundary. Error-correcting codes operating across band labels provide a second layer of robustness for dimensions whose drift exceeds half the band width. The sketching mechanism is invariant under the substitutions that practical deployment requires: the same sketch is recovered when the sensor is replaced by a different sensor of equivalent tier, when ambient illumination changes within the calibrated envelope, when posture or contact pressure varies within the calibrated envelope, and when the subject's physiological state varies within the bounds for which the feature transform was characterized.

The privacy guarantee carried by the helper data is structural rather than computational. Because the offset is taken modulo band width, the helper data discloses only the residue of the feature value within a single band; it does not disclose which band, and therefore does not disclose the feature value itself except to within the band-width interval. An adversary observing the helper data alone learns nothing about the population-level distribution of feature values and learns no more about a specific subject's feature value than the band width itself permits. The sketch, taken together with the helper data, identifies the subject; either alone does not.

Operating Parameters

Band width is the primary tunable parameter. It is set, per feature dimension, to a multiple of the measured noise standard deviation; multiples in the range of two to six standard deviations are contemplated, with three being a default that balances false-reject rate against discriminative power. Sketch length, expressed in bits, is determined by the dimensionality of the feature vector and the bit allocation per dimension. Sketches in the range of one hundred twenty-eight to one thousand twenty-four bits are contemplated, with shorter sketches preferred for transport and longer sketches preferred for high-assurance domains.

Helper-data length is comparable to sketch length and is chosen so that the offset can be represented at the resolution required to restore band alignment. Error-correcting code parameters, including code rate and minimum distance, are selected to absorb the residual band-boundary excursions that the helper-data offset does not eliminate. Acquisition tier is a discrete parameter that selects among pre-characterized noise envelopes; a higher tier corresponds to a sensor and capture protocol with smaller noise standard deviation, permitting narrower bands and therefore higher discriminative power at fixed sketch length. Re-enrollment cadence is a temporal parameter that bounds the period over which a single sketch is treated as valid, accommodating slow physiological drift without compromising stability within the cadence window.

Alternative Embodiments

In a first alternative embodiment, the banding is non-uniform: band widths vary along the feature axis to track the local density of the enrollment population, so that frequently occupied regions of the feature space are partitioned more finely than sparsely occupied regions. In a second alternative embodiment, the helper data is encrypted under a key derived from a secondary factor, so that sketch reproduction requires both the biological observation and the secondary factor; this embodiment is suited to deployments requiring two-factor binding at the sketch layer.

In a third alternative embodiment, the sketch is computed as a locality-sensitive hash of the feature vector rather than as a banded quantization, with helper data taking the form of a randomization seed and a small set of correction bits; this embodiment trades exact reproducibility for greater tolerance to feature-vector rotation and scaling. In a fourth alternative embodiment, sketches from multiple modalities, such as a fingerprint sketch, an iris sketch, and a cardiac-waveform sketch, are concatenated and then jointly quantized, producing a fused sketch whose stability exceeds that of any single modality. In a fifth alternative embodiment, the sketching mechanism operates on a streaming feature, such as a continuous electrocardiogram, by maintaining a rolling sketch over a sliding window and emitting a sketch only when the rolling estimate has converged within a configured tolerance.

Composition with Adjacent Mechanisms

The stable sketch is consumed by a domain-separated hash that produces a per-context identifier, by a fusion stage that combines sketches across modalities, and by a continuity check that verifies the present sketch against a stored anchor. Helper data is consumed only at the sketching stage; it is never forwarded to downstream consumers, and downstream consumers operate solely on the sketch. The mechanism composes with revocation: when an identity must be revoked, the helper data is invalidated, rendering future observations unable to reproduce the prior sketch even though the biological source is unchanged. The mechanism composes with renewal: a new helper-data record can be generated from a fresh enrollment, producing a new sketch that is unlinkable to the prior sketch under the privacy guarantee of the helper-data construction.

Distinction Over Prior Art

Prior fuzzy-extractor constructions establish that helper data can be published without disclosing the underlying secret, but those constructions presuppose a secret with high min-entropy and a tightly bounded error model. The present mechanism extends fuzzy extraction to biological features whose entropy is unevenly distributed across dimensions and whose error model varies with acquisition tier; the per-dimension band-width calibration and the tier-indexed noise envelope are the structural extensions that make the construction practical for deployed biometrics. Prior cancellable-biometric schemes apply a non-invertible transform to the feature vector but do not provide noise tolerance through helper data; they trade matching accuracy for revocability. The present mechanism preserves matching accuracy through the helper-data offset and obtains revocability through helper-data invalidation, separating the two concerns at the structural level.

Failure Modes and Mitigations

Three failure modes are identified and addressed at the structural level. The first is sensor drift, in which the noise envelope of a deployed sensor diverges over time from the envelope assumed at enrollment. Drift is detected by periodic measurement of the per-dimension noise standard deviation against a sentinel signal of known properties; when drift exceeds a configured fraction of the band width, re-enrollment is triggered before the false-reject rate has degraded perceptibly. The second failure mode is helper-data leakage, in which an adversary with access to many helper-data records attempts to recover the underlying feature distribution by aggregation. Leakage is bounded by the entropy budget of the helper-data construction: the per-dimension offset reveals at most one band-width worth of information, which is calibrated to be insufficient for distributional reconstruction across the population. The third failure mode is correlated noise, in which the noise across feature dimensions is not independent as the band-width calibration assumes. Correlation is detected by periodic estimation of the cross-dimensional covariance, and when significant correlation is observed the feature transform is rotated into a basis in which the residual covariance is diagonal, restoring the independence assumption that the banding relies upon.

A fourth concern, presentation attacks, is addressed adjacent to but not within the sketching mechanism: liveness detection and capture-protocol enforcement operate before feature extraction, ensuring that the input to the sketching mechanism is a genuine biological signal rather than a replayed or synthesized one. The sketching mechanism's contribution is to ensure that, given a genuine signal, the resulting sketch is stable, private, and revocable; presentation-attack resistance is the responsibility of the acquisition stage and is composed with the sketching mechanism rather than implemented within it.

Disclosure Scope

The disclosure encompasses the sketching mechanism, the helper-data construction, the per-dimension band-width calibration, the tier-indexed noise envelope, and the compositions described above. The disclosure is not limited to any specific biological modality, sensor technology, feature transform, or error-correcting code family; the structural relationships among feature vector, banding, helper data, and sketch are the subject matter, and any combination of components that realizes those relationships falls within the scope of the disclosure. Implementations in fixed-function hardware, in general-purpose processors, in trusted execution environments, and in distributed systems where enrollment and verification occur on different devices are all contemplated.