ChatGPT shines in medical summary task, struggles with field-specific relevance

Trending 2 weeks ago

In a caller study published successful The Annals of Family Medicine, a group of researchers evaluated Chat Generative Pretrained Transformer (ChatGPT)'s efficacy successful summarizing aesculapian abstracts to assistance physicians by providing concise, accurate, and unbiased summaries amidst nan accelerated description of objective knowledge and constricted reappraisal time.

 PolyPloiid / ShutterstockStudy: Quality, Accuracy, and Bias successful ChatGPT-Based Summarization of Medical Abstracts. Image Credit: PolyPloiid / Shutterstock


In 2020, astir a cardinal caller diary articles were indexed by PubMed, reflecting nan accelerated doubling of world aesculapian knowledge each 73 days. This growth, coupled pinch objective models prioritizing productivity, leaves physicians small clip to support up pinch literature, moreover successful their ain specialties. Artificial Intelligence (AI) and earthy connection processing connection promising devices to reside this challenge. Large Language Models (LLMs) for illustration ChatGPT, which tin make text, summarize, and predict, person gained attraction for perchance aiding physicians successful efficiently reviewing aesculapian literature. However, LLMs tin nutrient misleading, non-factual matter aliases "hallucinate" and whitethorn bespeak biases from their training data, raising concerns astir their responsible usage successful healthcare. 

About nan study 

In nan coming study, researchers selected 10 articles from each of nan 14 journals, including a wide scope of aesculapian topics, article structures, and diary effect factors. They aimed to see divers study types while excluding non-research materials. The action process was designed to guarantee that each articles published successful 2022 were chartless to ChatGPT, which had been trained connected information disposable until 2021, to destruct nan anticipation of nan exemplary having anterior vulnerability to nan content.

The researchers past tasked ChatGPT pinch summarizing these articles, self-assessing nan summaries for quality, accuracy, and bias, and evaluating their relevance crossed 10 aesculapian fields. They constricted summaries to 125 words and collected information connected nan model's capacity successful a system database. 

Physician reviewers independently evaluated nan ChatGPT-generated summaries, assessing them for quality, accuracy, bias, and relevance pinch a standardized scoring system. Their reappraisal process was cautiously system to guarantee impartiality and a broad knowing of nan summaries' inferior and reliability.

The study conducted elaborate statistical and qualitative analyses to comparison nan capacity of ChatGPT summaries against quality assessments. This included examining nan alignment betwixt ChatGPT's article relevance ratings and those assigned by physicians, some astatine nan diary and article levels. 

Study results 

The study utilized ChatGPT to condense 140 aesculapian abstracts from 14 divers journals, predominantly featuring system formats. The abstracts, connected average, contained 2,438 characters, which ChatGPT successfully reduced by 70% to 739 characters. Physicians evaluated these summaries, standing them highly for value and accuracy and demonstrating minimal bias, a uncovering mirrored successful ChatGPT's self-assessment. Notably, nan study observed nary important variance successful these ratings erstwhile comparing crossed journals aliases betwixt system and unstructured absurd formats.

Despite nan precocious ratings, nan squad did place immoderate instances of superior inaccuracies and hallucinations successful a mini fraction of nan summaries. These errors ranged from omitted captious information to misinterpretations of study designs, perchance altering nan mentation of investigation findings. Additionally, insignificant inaccuracies were noted, typically involving subtle aspects that did not drastically alteration nan abstract's original meaning but could present ambiguity aliases oversimplify analyzable outcomes.

A cardinal constituent of nan study was examining ChatGPT's capacity to admit nan relevance of articles to circumstantial aesculapian disciplines. The anticipation was that ChatGPT could accurately place nan topical attraction of journals, aligning pinch predefined assumptions astir their relevance to various aesculapian fields. This presumption held existent astatine nan diary level, pinch a important alignment betwixt nan relevance scores assigned by ChatGPT and those by physicians, indicating ChatGPT's beardown expertise to grasp nan wide thematic predisposition of different journals.

However, erstwhile evaluating nan relevance of individual articles to circumstantial aesculapian specialties, ChatGPT's capacity was little impressive, showing only a humble relationship pinch human-assigned relevance scores. This discrepancy highlighted a limitation successful ChatGPT's expertise to accurately pinpoint nan relevance of singular articles wrong nan broader discourse of aesculapian specialties contempt a mostly reliable capacity connected a broader scale.

Further analyses, including sensitivity and value assessments, revealed a accordant distribution of quality, accuracy, and bias scores crossed individual and corporate quality reviews arsenic good arsenic those conducted by ChatGPT. This consistency suggested effective standardization among quality reviewers and aligned intimately pinch ChatGPT's assessments, indicating a wide statement connected nan summarization capacity contempt nan challenges identified.


To summarize, nan study's findings indicated that ChatGPT efficaciously produced concise, accurate, and low-bias summaries, suggesting its inferior for clinicians successful quickly screening articles. However, ChatGPT struggled pinch accurately determining nan relevance of articles to circumstantial aesculapian fields, limiting its imaginable arsenic a integer supplier for lit surveillance. Acknowledging limitations specified arsenic its attraction connected high-impact journals and system abstracts, nan study highlighted nan request for further research. It suggests that early iterations of connection models whitethorn connection improvements successful summarization value and relevance classification, advocating for responsible AI usage successful aesculapian investigation and practice.

Journal reference:

  • Joel Hake, Miles Crowley, Allison Coy, et al. Quality, Accuracy, and Bias successful ChatGPT-Based Summarization of Medical Abstracts, The Annals of Family Medicine (2024), DOI:  10.1370/afm.3075,