Research Spotlight: Generative AI “Drift” and “Nondeterminism” Inconsistencies Are Important Considerations in Healthcare Applications

Contributor: Samuel (Sandy) Aronson, ALM, MA

Aug 13, 2024

4 minute read

Samuel (Sandy) Aronson, ALM, MA, executive director of IT and AI Solutions for Mass General Brigham Personalized Medicine and senior director of IT and AI Solutions for the Accelerator for Clinical Transformation, is the corresponding author of a paper published in NEJM AI that looked at whether generative AI could hold promise for improving scientific literature review of variants in clinical genetic testing. Their findings could have a wide impact beyond this use case.

How would you summarize your study for a lay audience?

We tested whether generative AI can be used to identify whether scientific articles contain information that can help geneticists determine whether genetic variants are harmful to patients. While testing this work, we identified inconsistencies in generative AI that could present a risk for patients if not adequately addressed. We suggest forms of testing and monitoring that could improve safety.

What question were you investigating?

We investigated whether generative AI can be used to determine: 1) whether a scientific article contains evidence about a variant that could help a geneticist’s assessment of a genetic variant and 2) whether any evidence found about the variant supports a benign, pathogenic, intermediate or inconclusive conclusion.

What methods or approach did you use?

We tested a generative AI strategy based on GPT-4 using a labeled dataset of 72 articles and compared generative AI to assessments from expert geneticists.

What did you find?

Generative AI performed relatively well, but more improvement is needed for most use cases. However, as we ran our tests repeatedly, we observed a phenomenon we deemed important: running the same test dataset repeatedly produced different results. Through repeated running of the test set over time, we characterized the variability. We found that both drift (changes in model performance over time) and nondeterminism (inconsistency between consecutive runs) were present. We developed visualizations that demonstrate the nature of these problems.

What are the implications?

If a clinical tool developer is not aware that large language models can exhibit significant drift and nondeterminism, they may run their test set once and use the results to determine whether their tool can be introduced into practice. This could be unsafe.

What are the next steps?

Our results show that it could be important to run a test set multiple times to demonstrate the degree of variability (nondeterminism) present. Our results also show that it is important to monitor for changes in performance (drift) over time.

Authorship: In addition to Aronson, Mass General Brigham authors include Kalotina Machini, Jiyeon Shin, Pranav Sriraman, Emma R. Henricks, Charlotte J. Mailly, Angie J. Nottage, Sami S. Amr, Michael Oates, and Matthew S. Lebo. Additional authors include Sean Hamill.

Paper cited: Aronson SJ et al. “Integrating GPT-4 Models into a Genetic Variant Assessment Clinical Workflow: Assessing Performance, Nondeterminism, and Drift in Classifying Functional Evidence from Literature” NEJM AI DOI: 10.1056/AIcs2400245

Disclosures: Aronson, Shin, Mailly, and Oates report research grants and similar funding via Brigham and Women’s Hospital from Better Therapeutics, Boehringer Ingelheim, Eli Lilly, Milestone Pharmaceuticals, NovoNordisk, and PICORI. Aronson, Oates, Machini, Henricks, and Lebo report NIH funding through Mass General Brigham. Aronson reports serving as a paid consultant for Nest Genomics.

Contributor

Samuel (Sandy) Aronson, ALM, MA

Executive director of IT and AI Solutions, Mass General Brigham Personalized Medicine

Related research about artificial intelligence

AI Screening for Heart Failure Clinical Trial Speeds Up Enrollment, Study Finds

published on Feb 17, 2025
Artificial Intelligence Drives New Approaches to Cancer Care

published on Feb 13, 2025
Using AI to Measure Prostate Cancer Lesions Could Aid Diagnosis and Treatment

published on Oct 29, 2024
Generative AI Model Study Shows No Racial or Sex Differences in Opioid Recommendations for Treating Pain

published on Sep 16, 2024
Artificial Intelligence and Digital Health in Radiology: A Guide for Innovators

published on Sep 13, 2024
Using AI for Early Detection of Lung Cancer

published on Sep 5, 2024
Using AI to Personalize Treatments for Non-melanoma Head and Neck Skin Cancers

published on Sep 5, 2024
AI Tool Offers More Accurate Detection of Immune-Related Adverse Events in Cancer Patients

published on Sep 4, 2024
Research Spotlight: Generative AI “Drift” and “Nondeterminism” Inconsistencies Are Important Considerations in Healthcare Applications

published on Aug 13, 2024