Research Spotlight: Head-to-Head Comparisons of Generative Artificial Intelligence and Internal Medicine Physicians

Jul 1, 2024

Daniel Restrepo, MD, a Mass General Brigham hospital medicine specialist and physician in the Department of Internal Medicine at Massachusetts General Hospital, recently published two research papers comparing the clinical reasoning abilities of large language models (LLMs) to that of physicians. His research was recently published in JAMA Internal Medicine and the Journal of Hospital Medicine.

What led you to research this topic?

Our studies sought to learn whether AI could improve a physician’s ability to diagnose patients. AI might offer some benefits in helping clinicians avoid some of the mental errors in clinical reasoning that can lead to a misdiagnosis. Clinical reasoning refers to the thought processes that allow doctors to reach a diagnosis and is perhaps the most important procedure that physicians perform on a daily basis.

LLMs are a form of AI that can process large amounts of information from sources like the internet and generate answers to questions that read like a conversation. The usefulness of these models at performing certain tasks has been an increasing area of study across the healthcare field.

How did you conduct your study?

We conducted two studies. We performed a live comparison of how a human doctor and an LLM approached a diagnostic mystery and also a separate study comparing the reasoning skills of human doctors to that of an LLM known as GPT4.

In the live comparison study, we wanted to compare and contrast the strengths and strategies taken by both a human internal medicine physician and an LLM, and provide insights into how this technology can help diagnosticians in the future. We provided a case of a 35-year-old man who was referred to the emergency department with low blood pressure and a fast heart rate (tachycardia), and summarized symptoms he had been experiencing – such as recent hearing loss – and tests that were ordered. We then asked the physician to explain their reasoning, and compared each step to the output of the LLM.

Separately, we conducted an investigation that compared the reasoning abilities of 21 resident and 18 attending physicians to that of GPT4. The doctors assessed clinical cases divided into segments of information wherein they were asked to verbalize their reasoning and differential diagnosis as they progressed through sets of clinical information. Differential diagnoses refer to lists of suspected diseases physicians come up with when making a diagnosis. The answers were graded by experts in clinical reasoning who were blinded to whether a human doctor or LLM responded to the case segment.

What did you find?

Our live head-to-head demonstration with an internal medicine physician yielded interesting observations. Both the physician and LLM came up with the correct diagnosis of granulomatosis with polyangiitis, a rare inflammatory disease, but the two went about it in very different ways. The physician relied on clinical reasoning and diagnostic schemas for inflammatory disease categories, whereas the LLM was more focused on matching the patient’s pattern of symptoms to a diagnosis. That in turn led the LLM to be slower to incorporate new test data points introduced during the demonstration. Our comparison suggested AI may have shortcomings in reasoning capabilities.

Our investigation with residents and attending physicians found that GPT performed comparably to both groups in certain measures of clinical reasoning. Specifically, GPT-4 suggested the correct diagnosis about 40% of the time, and the correct diagnosis was included in its initial list of differential diagnoses 67% of the time. GPT however had more frequent instances of incorrect clinical reasoning (~14%) compared to residents (~3%) and attendings (12.5%). Additionally, it is worth noting that one of the central limitations in our study was that the metric used to grade reasoning rewarded verbosity and we found that GPT-4 simply wrote more. This is important given that studies show that experts in clinical reasoning tend to include less, yet more salient features, when distilling what is important in making a diagnosis.

Do your findings suggest AI could one day be used for diagnosing patients?

Misdiagnosis is unfortunately quite common and affects patients worldwide. Many cases of misdiagnosis can be attributable to cognitive errors in the diagnostician's mind, which can occur if a clinician is fatigued, for example. LLM's have been shown by our research and other studies as comparable in many aspects of knowledge and reasoning to physicians. If used correctly, they could augment our abilities as diagnosticians and help keep patients safe.

When and how generative AI is implemented will require significant future study and it would be too premature to offer these tools in patient care. There are numerous considerations to address, including accounting for biases, “hallucinations” (or false information generated by the chatbots), as well as data safety and privacy concerns.

However, our research ultimately suggests that we in healthcare need to change the narrative of diagnosticians versus AI. Instead, we believe that the future of diagnosis is as diagnosticians alongside AI, with the technology augmenting but not replacing the clinical reasoning process.

Authorship: In addition to Restrepo, Raja-Elie Abdulnour, MD, was a Mass General Brigham co-author on both studies.

Stephanie Cabral, MD; Zahir Kanjee, MD, MPH; Philip Wilson, MD; Byron Crowe, MD Raja-Elie Abdulnour and Adam Rodman, MD, MPH of Beth Israel Deaconess Medical Center co-authored the clinical JAMA Internal Medicine paper, and Dr. Rodman was also a co-author on the Journal of Hospital Medicine live case comparison.

Papers cited: Cabral et al. “Clinical Reasoning of a Generative Artificial Intelligence Model Compared With Physician” JAMA Intern Med :DOI 10.1001/jamainternmed.2024.0295

Restrepo et al. “Conversations On Reasoning: Large Language Models in Diagnosis” J Hosp. Med. DOI 10.1002/jhm.13378

Media contact

Ryan Jaslow

Program Director, External Communications (Research)

rjaslow@mgb.org

About Mass General Brigham

Mass General Brigham is an integrated academic health care system, uniting great minds to solve the hardest problems in medicine for our communities and the world. Mass General Brigham connects a full continuum of care across a system of academic medical centers, community and specialty hospitals, a health insurance plan, physician networks, community health centers, home care, and long-term care services. Mass General Brigham is a nonprofit organization committed to patient care, research, teaching, and service to the community. In addition, Mass General Brigham is one of the nation’s leading biomedical research organizations with several Harvard Medical School teaching hospitals. For more information, please visit massgeneralbrigham.org.

Related research about artificial intelligence

AI Screening for Heart Failure Clinical Trial Speeds Up Enrollment, Study Finds

published on Feb 17, 2025
Artificial Intelligence Drives New Approaches to Cancer Care

published on Feb 13, 2025
Using AI to Measure Prostate Cancer Lesions Could Aid Diagnosis and Treatment

published on Oct 29, 2024
Generative AI Model Study Shows No Racial or Sex Differences in Opioid Recommendations for Treating Pain

published on Sep 16, 2024
Artificial Intelligence and Digital Health in Radiology: A Guide for Innovators

published on Sep 13, 2024
Using AI for Early Detection of Lung Cancer

published on Sep 5, 2024
Using AI to Personalize Treatments for Non-melanoma Head and Neck Skin Cancers

published on Sep 5, 2024
AI Tool Offers More Accurate Detection of Immune-Related Adverse Events in Cancer Patients

published on Sep 4, 2024
Research Spotlight: Generative AI “Drift” and “Nondeterminism” Inconsistencies Are Important Considerations in Healthcare Applications

published on Aug 13, 2024