Can reasoning LLMs enhance clinical document classification?

Mustafa, Akram, Naseem, Usman, and Rahimi Azghadi, Mostafa (2026) Can reasoning LLMs enhance clinical document classification? Health and Technology. (In Press)

Preview

PDF (Publisher Accepted Version) - Published Version
Available under License Creative Commons Attribution.
Download (1MB) | Preview

DOI: 10.1007/s12553-025-01041-y

View at Publisher Website: https://doi.org/10.1007/s12553-025-01041...

Abstract

Background: Clinical document classification is a critical process in healthcare, converting unstructured medical texts into standardized ICD-10 diagnoses. This process faces challenges due to the complex and varied nature of medical language, which includes domain specific terminology, abbreviations, and unique writing styles across institutions. Additionally, privacy regulations and limited high quality annotated datasets hinder the development of robust models. LLMs have emerged as a transformative technology in healthcare, improving the efficiency and accuracy of tasks like clinical document classification by leveraging advanced natural language understanding. Objective: The objective of this study is to evaluate the performance and consistency of LLMs in binary classification clinical discharge summaries based on ICD-10 codes. By leveraging both reasoning and non-reasoning LLMs, the study aims to determine how effectively these models can identify and classify clinical patterns in a binary context, providing insights into their potential for improving automated clinical coding accuracy and enhancing decision support in healthcare settings. Methods: This study used a balanced subset of the MIMIC-IV dataset, comprising 3,000 discharge summaries including 150 positive and 150 negative samples for each of the top 10 ICD-10 codes. The summaries were tokenized using cTAKES, which converted clinical narratives into structured SNOMED codes, capturing contextual details such as affirmation or negation. Eight LLMs, including four reasoning (Qwen QWQ, Deepseek Reasoner, GPT o3 Mini, Gemini 2.0 Flash Thinking) and four non-reasoning models (Llama 3.3, GPT 4o Mini, Gemini 2.0 Flash, Deepseek Chat), were evaluated over three experimental runs. Final predictions were determined using majority voting across the runs to assess accuracy, F1 score, and consistency. Results: Among the eight evaluated LLMs, reasoning models demonstrated superior performance in ICD-10 classification, achieving an average accuracy of 71% and an F1 score of 67%, compared to 68% accuracy and 60% F1 score for non-reasoning models. Gemini 2.0 Flash Thinking achieved the highest accuracy at 75% and F1 score at 76%, while GPT 4o Mini had the lowest performance 64% accuracy, and 47% F1 score. Consistency analysis revealed that non-reasoning models exhibited higher stability of 91% average consistency than reasoning models of 84%. Performance variations across ICD-10 codes highlighted strengths in identifying well defined conditions but challenges in classifying abstract diagnostic categories. Conclusion: The evaluation of reasoning and non-reasoning LLMs in ICD-10 classification highlights a trade-off between accuracy and consistency. Reasoning models achieved higher classification accuracy and F1 scores, excelling in complex clinical cases, while non-reasoning models demonstrated superior stability across repeated trials. These findings suggest that a hybrid approach, leveraging the strengths of both model types, could optimize automated clinical coding by balancing accuracy and reliability. Future research should explore multi-label classification, domain specific fine tuning, and ensemble modeling to enhance performance and generalizability in real-world healthcare applications.


Item ID:	90790
Item Type:	Article (Research - C1)
ISSN:	2190-7196
Keywords:	ChatGPT, Clinical coding, DeepSeek, Gemini, Large language model, Llama, Qwen, Reasoning
Copyright Information:	This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/
Date Deposited:	21 May 2026 01:19
FoR Codes:	46 INFORMATION AND COMPUTING SCIENCES > 4602 Artificial intelligence > 460299 Artificial intelligence not elsewhere classified @ 100%
SEO Codes:	22 INFORMATION AND COMMUNICATION SERVICES > 2204 Information systems, technologies and services > 220403 Artificial intelligence @ 100%
	More Statistics

Actions (Repository Staff Only)

Item Control Page