Study Reveals ChatGPT Overprescribes in Emergency Care

If ChatGPT were released in the Emergency care, it may recommend unnecessary X-rays and antibiotics for some patients while admitting others who did not require hospital treatment, according to a new study from UC San Francisco.

According to the researchers, while the model can be instructed in ways that improve its accuracy, it is still no match for a human doctor’s clinical judgment.

“This is a valuable message to clinicians not to blindly trust these models,” said postdoctoral scholar Chris Williams, MB BChir, principal author of the study, which was published Oct. 8 in Nature Communications. “ChatGPT can answer medical exam questions and help draft clinical notes, but it’s not currently designed for situations that call for multiple considerations, like the situations in an emergency department.”

Ehance your expertise in emergency care with Emergency medicine CME/CE Conferences and Online Courses

Recently, Williams demonstrated that ChatGPT, a large language model (LLM) that may be used to investigate clinical applications of AI, performed marginally better than humans in selecting which of two emergency patients was most acutely ill, a simple choice between patient A and B.

In the current work, Williams challenged the AI model to execute a more sophisticated task: giving the suggestions made by a physician after initially assessing a patient in the emergency department. This involves deciding whether to admit the patient, perform X-rays or other tests, or administer antibiotics.

AI models are less accurate than residents
For each of the three decisions, the researchers selected 1,000 ED visits from an archive of over 251,000 visits. The sets had the same “yes” to “no” ratio for admission, imaging, and antibiotic decisions as found throughout UCSF Health’s Emergency Department.

Using UCSF’s secure generative AI platform, which includes extensive privacy safeguards, the researchers entered doctors’ notes on each patient’s symptoms and examination findings into ChatGPT-3.5 and ChatGPT-4. They next examined each set’s accuracy with a series of more detailed cues.

Overall, AI models tended to recommend services more frequently than was necessary. ChatGPT-4 was 8% less accurate than resident physicians, whereas ChatGPT-3.5 was 24% less precise.

Williams speculated that the AI’s proclivity to overprescribe stems from the fact that the models were trained on the internet, where respectable medical advice sites aren’t geared to handle emergency medical problems but rather to refer users to a doctor who can.

“These models are almost fine-tuned to say, ‘seek medical advice,” which is quite right from a general public safety perspective,” he said. “But erring on the side of caution isn’t always appropriate in the ED setting, where unnecessary interventions could cause patients harm, strain resources and lead to higher costs for patients.”

He stated that models such as ChatGPT will require improved frameworks for analyzing clinical information before they are ready for the ED. The humans who create those frameworks will have to strike a compromise between ensuring that the AI does not miss something critical and avoiding unnecessary exams and fees.

This means that academics building AI-powered medical applications, as well as the larger clinical community and the general public, must evaluate where to draw those borders and how far to err on the side of caution.

“There’s no perfect solution,” he commented. “But knowing that models like ChatGPT have these tendencies, we’re charged with thinking through how we want them to perform in clinical practice.”

For more information: Evaluating the use of large language models to provide clinical recommendations in the Emergency Department, Nature Communications, https://www.nature.com/articles/s41467-024-52415-1

Study Reveals ChatGPT Overprescribes in Emergency Care

Haritha Karyampudi

more recommended stories

Leave a Comment

Cancel reply

USA – Headquarters

Quick Links