Semantic Discovery for Scientific Research

Nick Clark

Regulatory Framework

The governance of scientific research now spans funder mandates, publisher policies, professional codes, and an increasingly explicit set of community norms around integrity and reproducibility. The 2022 OSTP "Nelson Memo" directed every U.S. federal funding agency to make peer-reviewed publications and supporting data immediately available without embargo, and the resulting NIH Public Access Policy update (effective late 2025) and NSF Public Access Plan (Phase II) bind grantees to deposit accepted manuscripts in approved repositories with persistent identifiers and machine-readable metadata. The European counterpart, expressed through Plan S, the EOSC Federation Handbook, and Horizon Europe grant conditions, requires open access and FAIR-aligned data management plans for all funded outputs.

Authorship and integrity are governed by ICMJE recommendations, the COPE core practices, the Singapore Statement on Research Integrity, and discipline-specific codes such as the ACM Code of Ethics and Professional Conduct, the APA Ethics Code, and the ICMJE-aligned policies of major biomedical publishers. Retractions and corrections are surfaced through Retraction Watch and the Crossref Retraction Watch dataset, which since 2023 has been freely available as a structured feed. The legal status of mass-scraped literature collections, Sci-Hub being the obvious case, sits on top of all of this, complicating any system that purports to ingest "all of the literature" without rights traceability.

Threaded through the regulatory layer is FAIR: findable, accessible, interoperable, reusable. FAIR is not a statute but a normative reference that funders, publishers, and infrastructure providers increasingly cite as a contractual standard. ORCID identifiers make authorship findable; DataCite DOIs make datasets citable; ROR identifiers disambiguate institutions; Crossref event data exposes citation and link relationships. The framework is mature. The instruments that operate over it, search engines, databases, and AI assistants, have not caught up.

Architectural Requirement

A discovery system that honors the regulatory framework must do more than return documents. It must preserve identity (ORCID, ROR, DOI) through every step of the inquiry; it must distinguish FAIR-compliant outputs from unstructured artifacts; it must respect the trust gradient between peer-reviewed publications, preprints, registered reports, and gray literature; it must surface retraction status as a first-class signal; and it must produce a lineage that an auditor, a co-author, or a regulator can replay.

The architectural unit therefore cannot be the query string. It must be a discovery object: a typed, persistent record that carries the researcher's question, the accumulated context of prior results, the trust scope under which the inquiry is operating, and the lineage of every traversal step. Each step of discovery, finding a candidate, evaluating it, accepting or rejecting it, updating the question, is a structural transition on that object, not an ephemeral interaction with a search box. This is what semantic discovery means in practice: search, inference, and execution unified into a single governed traversal whose internal state is fully observable.

The architectural requirement also implies provenance at the claim level. A literature review that synthesizes findings from twelve sources must be able to attribute every clause of the synthesis to the specific source and traversal step that produced it. This is the requirement that ICMJE authorship norms, COPE integrity practices, and the FAIR reusability principle jointly impose, and it is precisely the requirement that ungrounded LLM summarization cannot satisfy.

Why Procedural Compliance Fails

The procedural model of literature discovery, keyword search, manual screening, citation chasing, and a bibliographic manager, was designed for a publication landscape an order of magnitude smaller than today's. PubMed indexes more than 36 million records; arXiv exceeds 2.5 million preprints; bioRxiv and medRxiv add tens of thousands per month; OpenAlex catalogs more than 250 million scholarly works. Manual procedure cannot scale to this volume, and the prevailing accommodation, Boolean expert searches followed by inclusion/exclusion at title-and-abstract level, is well documented to miss between 15 and 40 percent of relevant work in systematic reviews, particularly across disciplinary boundaries.

AI-assisted search, in its current form, recapitulates the same limits at higher speed. An LLM-powered research assistant reformulates the query, retrieves keyword-matched results from the underlying index, and summarizes them. Discovery surface is unchanged; only the navigation is more fluent. Worse, the synthesis produced by such assistants typically lacks claim-level provenance: the researcher receives a paragraph that blends summary with interpolation, and the only available trust signal is the model's stylistic confidence. This violates ICMJE attribution norms, defeats COPE integrity expectations, and produces synthesis that is not reusable in the FAIR sense because it is not reproducible.

Procedural compliance also fails the integrity-signal requirement. Retraction Watch records that thousands of retractions are issued per year, and the half-life from publication to retraction can exceed five years. A keyword search that surfaces a retracted paper alongside current literature, with no structural retraction flag, cannot satisfy the publisher and funder norms now emerging around citation hygiene. Sci-Hub-derived corpora compound the problem by stripping rights and provenance metadata entirely.

What AQ Primitive Provides

Semantic discovery, as a structural primitive, replaces retrieval with governed traversal across a typed knowledge graph. The primitive operates over a discovery object that persists across sessions and accumulates state. The object's fields include the active research question, the trust scope under which traversal proceeds, the corpus of accepted claims with their attached identifiers, the open hypotheses, and the full lineage of traversal steps with timestamps and provenance.

Each traversal step is a single governed operation. The system proposes a candidate, a paper, a dataset, a preprint, an institutional report, surfaced through semantic neighborhood rather than keyword match. The discovery object's trust policy evaluates the candidate: is it peer-reviewed, registered, retracted, preprinted, or gray? Does its identifier resolve through Crossref or DataCite? Does the author's ORCID match the institutional ROR? If the candidate passes the trust gate, its claims are extracted and reconciled against the existing claim corpus, with conflicts surfaced for the researcher rather than silently averaged.

Trust scoping is explicit. A researcher conducting a clinical evidence review can constrain traversal to peer-reviewed and registered-report sources, with preprints surfaced separately as a candidate set requiring explicit promotion. A researcher exploring a frontier topic can widen the scope to include preprints and conference papers, with the trust class recorded against every accepted claim so that the eventual synthesis carries the gradient. Traversal across disciplinary boundaries, the protein-folding-meets-graph-theory case, is enabled because the neighborhood is semantic, but the trust policy applies uniformly across the boundary.

Lineage is the feature that satisfies the regulatory framework. Every traversal step is signed and recorded: which candidate was surfaced, why, which trust gate it passed, which claims were extracted, which were accepted, which were rejected, and what was updated in the discovery object's state. A systematic review produced by traversal can be exported with full PRISMA-compatible flow and an auditable lineage of every inclusion and exclusion. A funder reviewing a deliverable can replay the discovery object to verify that the synthesis it produced is supported by its lineage. A co-author can reopen the object weeks later and continue the inquiry from precisely the state in which it was suspended.

Compliance Mapping

The discovery object's structure maps onto the regulatory and normative stack with no significant gaps. NIH Public Access and NSF Public Access Plan deposit obligations are satisfied at the artifact level by the underlying repositories; the discovery primitive consumes those artifacts and preserves their identifiers in lineage. FAIR's findability is satisfied through DOI, ORCID, and ROR resolution at every traversal step; accessibility is satisfied because the discovery object records the access route used; interoperability is satisfied because the object itself is FAIR-compliant typed data; reusability is satisfied because the lineage is replayable.

ICMJE authorship and contribution norms are supported by the lineage's recording of which traversal steps each contributor performed, addressing the COPE expectation that contribution be traceable. COPE core practices around integrity, peer review, and post-publication correction map directly onto the trust-scope and retraction-signal mechanisms: a paper retracted after acceptance into the discovery object emits a lineage event that flags every downstream claim derived from it. The ACM Code of Ethics and analogous discipline-specific codes are honored because the discovery primitive does not produce ungrounded claims; every output is attributable.

The European Open Science Cloud's federation model is naturally supported because discovery objects are portable across infrastructures; an object initiated against an EOSC node can traverse into U.S. repositories and back, with the trust scope and rights metadata preserved. The Crossref and DataCite event-data feeds become traversal substrate. Retraction Watch becomes a trust-policy input. Sci-Hub-derived material is excluded by trust policy because it lacks the rights and identifier provenance the policy requires; a researcher who needs a particular paywalled paper is routed through their institution's licensed access path, with the access route recorded.

Adoption Pathway

Adoption begins at the individual researcher level. A scientist initiates a discovery object for an active inquiry, a literature review, a grant proposal background section, a manuscript revision, and conducts subsequent search through that object. The immediate benefit is continuity across sessions and claim-level provenance in the eventual write-up. Existing tools (Zotero, EndNote, BibTeX) accept exports from the discovery object's accepted-claim corpus, so no workflow disruption is required.

The second phase is laboratory or research-group adoption. A principal investigator standardizes on discovery objects for systematic reviews, scoping reviews, and grant-background work. The lineage produced by the objects becomes part of the lab's research record, available to co-authors and reviewers. PRISMA-compliant systematic reviews are produced as a byproduct of governed traversal rather than as a separate documentation exercise. Cross-disciplinary collaborations benefit immediately because trust scoping carries cleanly across institutional boundaries.

The third phase is institutional and funder adoption. A university research office, a journal publisher, or a federal funder accepts discovery-object lineage as part of submission packages. A funder reviewing a renewal can replay the lineage of the prior cycle's literature work; a journal handling editor can verify that a manuscript's literature synthesis is traceable; an institutional integrity office investigating a misconduct allegation has a structured artifact to examine. At that stage, semantic discovery is no longer a productivity tool overlaid on top of incompatible search systems; it is the discovery layer that ORCID, DataCite, FAIR, and the public-access policies have been waiting for, completing a stack whose lower layers are already in place.