ChatGPT Excels in Medical Summaries, Lacks Field-Specific Relevance

ChatGPT in Medical Summaries
Study: Quality, Accuracy, and Bias in ChatGPT-Based Summarization of Medical Abstracts

In a recent study published in The Annals of Family Medicine, a group of researchers evaluated the efficacy of Chat Generative Pretrained Transformer (ChatGPT) in summarizing medical abstracts to assist physicians amidst the rapid expansion of clinical knowledge and limited review time.

Background

In 2020, nearly a million new journal articles were indexed by PubMed, indicating the rapid doubling of global medical knowledge every 73 days. This exponential growth, coupled with clinical models prioritizing productivity, leaves physicians with little time to keep abreast of literature, even within their specialties. Artificial Intelligence (AI) and natural language processing present promising tools to address this challenge. Large Language Models (LLMs) like ChatGPT, which possess the ability to generate text, summarize, and predict, have garnered attention for potentially aiding physicians in efficiently reviewing medical literature. However, LLMs can sometimes generate misleading or non-factual text, termed “hallucinations,” and may reflect biases from their training data, raising concerns about their responsible use in healthcare.

About the study

In this study, researchers selected 10 articles from each of the 14 journals, encompassing a broad spectrum of medical topics, article structures, and journal impact factors. The objective was to encompass diverse study types while excluding non-research materials. The selection process aimed to ensure that all articles published in 2022 were unknown to ChatGPT, trained on data available until 2021, eliminating the possibility of prior exposure to the content.

The researchers tasked ChatGPT with summarizing these articles, self-assessing the summaries for quality, accuracy, and bias, and assessing their relevance across ten medical fields. Summaries were limited to 125 words, and data on the model’s performance were collected in a structured database.

Physician reviewers independently evaluated the ChatGPT-generated summaries, assessing quality, accuracy, bias, and relevance using a standardized scoring system. The review process was carefully structured to ensure impartiality and a comprehensive understanding of the summaries’ utility and reliability.

The study conducted detailed statistical and qualitative analyses to compare ChatGPT’s performance against human assessments. This included examining the alignment between ChatGPT’s article relevance ratings and those assigned by physicians, both at the journal and article levels.

Study results

The study employed ChatGPT to condense 140 medical abstracts from 14 diverse journals, primarily featuring structured formats. ChatGPT successfully reduced the average abstract length of 2,438 characters by 70% to 739 characters. Physicians rated these summaries highly for quality and accuracy, demonstrating minimal bias, a finding corroborated by ChatGPT’s self-assessment. Notably, the study found no significant variance in ratings when comparing across journals or between structured and unstructured abstract formats. Despite high ratings, the team identified some instances of serious inaccuracies and hallucinations in a small fraction of summaries. These errors ranged from omitted critical data to misinterpretations of study designs, potentially altering the interpretation of research findings. Additionally, minor inaccuracies were noted, typically involving subtle aspects that did not drastically change the abstract’s original meaning but could introduce ambiguity or oversimplify complex outcomes.

A key aspect of the study was examining ChatGPT’s ability to recognize article relevance to specific medical disciplines. The expectation was that ChatGPT could accurately identify the topical focus of journals, aligning with predefined assumptions about their relevance to various medical fields. This hypothesis held true at the journal level, with a significant alignment between relevance scores assigned by ChatGPT and those by physicians, indicating its strong grasp of overall thematic orientation.

However, when evaluating article relevance to specific medical specialties, ChatGPT’s performance was less impressive, showing only a modest correlation with human-assigned relevance scores. This discrepancy highlighted a limitation in ChatGPT’s ability to accurately pinpoint article relevance within the broader context of medical specialties despite generally reliable performance on a broader scale.

Further analyses, including sensitivity and quality assessments, revealed consistent distribution of quality, accuracy, and bias scores across individual and collective human reviews, as well as those conducted by ChatGPT. This consistency suggested effective standardization among human reviewers and closely aligned with ChatGPT’s assessments, indicating broad agreement on summarization performance despite identified challenges.

Conclusions

In summary, the study’s findings suggested that ChatGPT effectively produced concise, accurate, and low-bias summaries, indicating its potential utility for clinicians in quickly screening articles. However, ChatGPT struggled with accurately determining article relevance to specific medical fields, limiting its potential as a digital agent for literature surveillance. Acknowledging limitations such as its focus on high-impact journals and structured abstracts, the study highlighted the need for further research. It suggests that future iterations of language models may offer improvements in summarization quality and relevance classification, advocating for responsible AI use in medical research and practice.

For more information: Quality, Accuracy, and Bias in ChatGPT-Based Summarization of Medical Abstracts, The Annals of Family Medicine, https://doi.org/10.1370/afm.3075

more recommended stories