A new study from Mass General Brigham researchers provides evidence that large language models (LLMs), used for generative artificial intelligence (AI), ChatGPT-4 and Google’s Gemini, demonstrated no differences in suggested opioid treatment regimens for different races or sexes. Results are published in PAIN.
“I see AI algorithms in the short term as augmenting tools that can essentially serve as a second set of eyes, running in parallel with medical professionals,” said corresponding author Marc Succi, MD, strategic innovation leader at Mass General Brigham Innovation, associate chair of innovation and commercialization for enterprise radiology and executive director of the Medically Engineered Solutions in Healthcare (MESH) Incubator at Mass General Brigham. “Needless to say, at the end of the day the final decision will always lie with your doctor.”
The results in this study showcase how LLMs could reduce potential provider bias and standardize treatment recommendations when it comes to prescribing opioids to manage pain. The emergence of artificial intelligence tools in health care has been groundbreaking and has the potential to positively reshape the continuum of care. Mass General Brigham, as one of the nation’s top integrated academic health systems and largest innovation enterprises, is leading the way in conducting rigorous research on new and emerging technologies to inform the responsible incorporation of AI into care delivery, workforce support, and administrative processes.
LLMs and other forms of AI have made headway in health care with several types of AI being tested to provide clinical judgement on imaging and patient workups, but there are also concerns that AI tools may perpetuate bias and exacerbate existing inequities.
For example, in the field of pain management, studies have shown that physicians are more likely to underestimate and undertreat pain in Black patients. Related studies on Emergency Department visits have also found White patients more likely to receive opioids compared to Black, Hispanic and Asian patients. There is concern that AI could worsen these biases in opioid prescription, which spurred Succi and his team to evaluate the partiality of AI models for opioid treatment plans.
For this study, the researchers initially compiled 40 patient cases reporting different types of pain (i.e. back pain, abdominal pain and headaches), and removed any references to patient race and sex. They then assigned each patient case a random race from 6 categories of possibilities (American Indian or Alaska Native, Asian, Black, Hispanic or Latino, Native Hawaiian or Other Pacific Islander, and White) before similarly assigning a random sex (male or female). They continued this process until all the unique combinations of race and sex were generated for each patient, resulting in 480 cases that were included in the dataset. For each case, the LLMs evaluated and assigned subjective pain ratings before making pain management recommendations.
The researchers found no differences from the AI models in opioid treatment suggestions for the varying races or sexes. Their analyses also revealed that ChatGPT-4 most frequently rated pain as “severe,” while Gemini’s most common rating was “moderate.” Despite this, Gemini was more likely to recommend opioids, suggesting that ChatGPT-4 is a more conservative model when making opioid prescription recommendations. Additional analyses of these AI tools could help determine which models are more in line with clinical expectations. "These results are reassuring in that patient race, ethnicity, and sex do not affect recommendations, indicating that these LLMs have the potential to help address existing bias in healthcare," said co-first authors, Cameron Young and Ellie Enichen, both students at Harvard Medical School.
The researchers note that not all race- and sex-related categories were studied since individuals of mixed races are unable to fit cleanly into the CDC’s defined classes of race. Moreover, the study evaluated sex as a binary variable (male and female) rather than on a spectrum of gender. Future studies should consider these other factors as well as how race could influence LLM treatment recommendations in other areas of medicine.
“There are many elements that we need to consider when integrating AI into treatment plans, such as the risk of over-prescribing or under-prescribing medications in pain management or whether patients are willing to accept treatment plans influenced by AI,” said Succi. “These are all questions we are considering, and we believe that our study adds key data showing how AI has the ability to reduce bias and improve health equity.”