Clinical AI That Pauses When It Should Not Act

Nick Clark

Clinical AI That Pauses When It Should Not Act

by Nick Clark | Published March 27, 2026 | PDF

Clinical AI systems produce recommendations regardless of their confidence level. A diagnostic AI with sixty percent confidence in a rare condition produces the same structured output as one with ninety-eight percent confidence in a common condition. The clinician receives both as recommendations, distinguished only by a probability score that may not reflect the system's true uncertainty. The pattern is structurally unsafe: it inverts the clinical reasoning principle that uncertainty must drive inquiry rather than action, and it places the burden of recognizing model limitations on the very clinician whose attention the model was deployed to spare. Confidence governance, as implemented in the Adaptive Query primitive, makes refusal-to-act a structural state of the agent. When computed confidence falls below the threshold appropriate to the decision's clinical consequence, the agent does not emit a low-confidence recommendation. It enters inquiry mode and produces a structured request for the additional information required to act confidently, in alignment with FDA's Predetermined Change Control Plan guidance, IEC 62304 Class C software lifecycle requirements, ISO 14971 risk management, and Joint Commission expectations for clinical decision support.

Regulatory Framework

Clinical AI sits at the intersection of medical device regulation, quality system regulation, and clinical practice oversight. The FDA's 2024 final guidance on Predetermined Change Control Plans (PCCPs) for AI/ML-enabled Software as a Medical Device (SaMD) establishes the agency's expectation that manufacturers anticipate model evolution and bound it within a pre-submitted modification protocol. The PCCP must specify not only what changes are permitted but what monitoring, performance, and abstention behaviors the device will exhibit when its inputs or outputs drift outside the validated envelope. Refusal-to-act is no longer an optional safety feature; it is an expected behavior of any AI/ML SaMD whose performance degrades on out-of-distribution inputs.

FDA 21 CFR Part 820 (the Quality System Regulation, harmonizing with ISO 13485 in 2026) requires design controls, risk-based verification and validation, and corrective and preventive action processes for medical devices including software. IEC 62304 specifies the software lifecycle for medical device software and assigns a safety class (A, B, or C) based on potential harm. Clinical decision support software whose failure could contribute to death or serious injury is Class C, requiring full architectural design, detailed verification, and traceable risk control measures for every identified hazard.

ISO 14971 (Medical Device Risk Management) governs the analytical framework. Hazard analysis must enumerate foreseeable misuse, including automation bias and uncritical acceptance of model outputs by time-pressured clinicians. Risk control measures must be inherently safe by design where possible, protective by mechanism where not, and informational only as a last resort. A confidence score on a recommendation is, in 14971 terms, an informational risk control: it relies on the user reading and correctly interpreting the score. Refusal-to-act is a protective measure: the system structurally prevents the unsafe output from being produced.

The European Medical Device Regulation (EU MDR 2017/745) imposes parallel obligations and adds Article 10 general safety and performance requirements that explicitly require manufacturers to minimize risks associated with use error, including those arising from foreseeable misuse. The Joint Commission's standards for clinical decision support (LD.04.03.09 and related elements of performance) require organizations to govern CDS deployment with attention to alert fatigue, override patterns, and the appropriate scope of automated recommendations. ANSI/AAMI HE75 (Human Factors Engineering) prescribes design principles for medical device user interfaces, including the principle that ambiguous information states must be presented as ambiguous rather than collapsed to a default value the user is likely to accept.

Architectural Requirement

These instruments converge on a structural requirement that goes beyond confidence display. The clinical AI must (1) compute an honest, calibrated assessment of its competence on the specific case at hand, not a generic model confidence; (2) compare that assessment against a threshold appropriate to the clinical consequence of the decision being made; (3) refuse to emit a recommendation when the threshold is not met, in a manner that the clinician cannot override silently; (4) substitute, in the refused case, a structured artifact that supports the clinician's reasoning rather than terminating it; and (5) record the refusal as a first-class clinical event for safety monitoring and post-market surveillance.

Calibration is the load-bearing requirement. A model that emits a confidence score of 0.8 must be correct on approximately 80 percent of cases for which it emits that score. Modern deep learning models are notoriously miscalibrated, often producing high confidence on out-of-distribution inputs precisely the cases where refusal is most important. The architecture must therefore include calibration diagnostics that operate at inference time, drawing on input-quality signals, distribution distance from the training set, and consistency with patient history rather than relying on the model's self-reported probability alone.

Threshold selection must reflect clinical consequence rather than statistical convenience. A screening recommendation that prompts further evaluation tolerates lower confidence than a treatment recommendation that would commit the patient to a high-risk intervention. The threshold structure must be governed: clinical leadership must set thresholds, change them through a controlled process, and have a record of the threshold in force at the moment any decision was made or refused.

Why Procedural Compliance Fails

The dominant procedural response to clinical AI uncertainty is to display a confidence score alongside every recommendation and to train clinicians to weight the score appropriately. This response fails for three reasons. First, confidence scores are model outputs, not governance states. A model that reports eighty percent confidence may be poorly calibrated, producing eighty percent confidence scores for cases where its actual accuracy is fifty percent. The confidence score is a number generated by the model, not a structural assessment of whether the system should be acting. Procedural compliance treats this number as information when, under ISO 14971, it should be treated as an informational risk control of the lowest reliability tier.

Second, automation bias is documented and durable. A growing body of evidence in radiology, pathology, dermatology, and emergency triage demonstrates that clinicians anchor on AI recommendations even when shown low confidence scores, even when warned about model limitations, and even when the recommendations conflict with their initial impressions. A clinical AI that always produces a recommendation, even when it should be saying it does not know, trains clinicians to expect and accept recommendations. The system's willingness to guess becomes the clinician's willingness to follow the guess. This is not a training failure; it is a predictable consequence of the cognitive load under which clinicians operate.

Third, the standard mitigation, suppressing outputs below a confidence threshold, treats confidence as a display filter rather than a governance mechanism. The model still ran. It still consumed the input data. It simply did not display its output for this case. There is no structural mechanism ensuring the system recognized its own limitations, no record of the refusal, no inquiry artifact for the clinician, and no signal to post-market surveillance that the model encountered an input it could not handle. From the perspective of FDA's PCCP framework, suppressed outputs are invisible to the monitoring plan that justified the device's authorization.

A single scalar confidence threshold cannot capture the multi-dimensional nature of clinical uncertainty. The system may be confident in its diagnosis but uncertain about treatment implication. It may be confident in the finding but uncertain about clinical significance for this patient given comorbidities. A single number collapses these dimensions into a scalar that loses the information clinicians need to reason about whether to follow, modify, or override the recommendation. Procedural compliance has no answer to this collapse beyond longer training modules and stronger warning labels, both of which fail under clinical workload.

What the AQ Primitive Provides

Confidence governance computes execution eligibility from a structured set of inputs: model uncertainty (with calibration diagnostics applied), input data quality, distribution distance from the training and validation cohorts, consistency with patient history and current clinical context, and the clinical significance of the decision being requested. The result is not a single score but a governance state: ready-to-act, ready-with-qualification, or non-executing-with-inquiry. The state is computed by a governed gate, not by the model itself, and the gate's logic is a signed artifact under change control with the same rigor as the model.

When the computed state is non-executing-with-inquiry, the agent does not emit a recommendation. Instead, it produces an inquiry artifact: a structured statement of what would be needed to raise confidence above the threshold for the intended action. The artifact identifies the specific tests, images, history elements, or specialist input that the model has determined would resolve its uncertainty. The clinician receives not a low-confidence guess but a clinically actionable request that mirrors the reasoning a competent colleague would offer: "I cannot reliably classify this finding without additional sequences" rather than "probable benign lesion, confidence 0.62."

This mirrors clinical reasoning. A physician who is uncertain does not guess. They order more tests, request a specialist consult, or defer the decision until more information is available. Confidence governance gives AI the same structural capacity to defer action in the face of uncertainty, and the deferral is recorded as a first-class event that flows into post-market surveillance, allowing the manufacturer to identify input regions where the device is reaching the boundary of its competence and feeding that signal back into the PCCP.

The confidence computation integrates integrity feedback over time. If the agent's recent recommendations have been inconsistent with outcomes, its confidence baseline decreases globally or in the subgroups where inconsistency has been observed. An agent that has been wrong recently in a specific input region carries lower confidence in that region than an agent with a strong recent track record, independent of any individual model output. The feedback loop is itself a governed artifact, with the rules for incorporating outcome data signed and versioned alongside the model.

Refusal is a clinical event, not a UI state. Each non-executing decision produces a signed record containing the input fingerprint, the governance state, the threshold in force, the inquiry artifact emitted, and a cryptographic link to the model and policy versions active. The record satisfies IEC 62304 traceability for the safety-relevant behavior, supports ISO 14971 post-production information requirements, and provides the audit substrate that FDA inspectors and Joint Commission surveyors expect to review when assessing CDS governance.

Compliance Mapping

The primitive maps directly to the substantive requirements of the regulatory framework. FDA's PCCP guidance is satisfied by the explicit codification of refusal behavior within the modification protocol: the manufacturer specifies the threshold structure, the calibration diagnostics, and the inquiry-artifact template as part of the device description, and changes to any of these proceed under the PCCP. 21 CFR Part 820 design control requirements are satisfied by the artifact chain linking clinical risk analysis to threshold selection to gate implementation to verification evidence.

IEC 62304 Class C software lifecycle obligations are satisfied because the gate is itself a Class C software item with full architectural documentation, detailed design, unit and integration verification, and traceable risk control linkage. ISO 14971 risk control hierarchy is honored: the protective measure (structural refusal to act) replaces the informational measure (display a confidence score) for the high-consequence hazards identified in the risk file. ISO 13485 quality system documentation flows from the same artifacts.

EU MDR Article 10 general safety and performance requirements regarding foreseeable misuse are addressed by removing the misuse vector: a clinician cannot uncritically accept a recommendation that was structurally not produced. Joint Commission CDS standards regarding alert fatigue and override governance are addressed because refusals are not interruptive alerts; they are substitutions of inquiry for recommendation, and overrides (clinician proceeding without the recommendation) are recorded events available for organizational review. ANSI/AAMI HE75 human factors principles regarding ambiguous information states are honored by presenting ambiguity as inquiry rather than as a low-confidence default.

Adoption Pathway

Adoption proceeds through a clinical-governance-led sequence aligned to existing CDS deployment practice. Phase one is a single contained domain, typically a radiology AI for a specific modality and body region or a triage CDS for a single emergency department. The clinical governance board sets confidence thresholds per decision class, drawing on the device's hazard analysis and on local clinical risk tolerance. The agent is deployed in shadow mode, producing both a recommendation and a governance state for each case while clinicians continue to make decisions on the existing pathway. The shadow period generates the calibration evidence needed to confirm threshold appropriateness and produces the inquiry-artifact templates that will be used in active deployment.

Phase two activates the governance gate. Cases below threshold no longer surface a recommendation; they surface an inquiry artifact. Clinicians retain full authority to act on their own judgment in the absence of a recommendation, and the rate, content, and outcomes of inquiry artifacts are reviewed in monthly clinical governance meetings. Calibration diagnostics and integrity-feedback signals flow into the manufacturer's PCCP monitoring plan, with model or threshold adjustments proceeding under the PCCP rather than as off-label drift.

Phase three extends the primitive across decision classes and clinical services. Treatment recommendations, screening recommendations, and triage decisions each acquire their own threshold structure under the same governance regime. The institution's CDS portfolio transitions from a collection of recommendation engines with confidence scores to a coherent set of governed agents whose refusal behavior is documented, monitored, and integrated into clinical workflow. Post-market surveillance shifts from passive complaint collection to active consumption of governance state telemetry, dramatically improving the manufacturer's ability to identify and respond to performance drift before it produces patient harm.