News Release

Decision-making performance of large language models vs. human physicians in challenging lung cancer cases: A real-world case-based study

Peer-Reviewed Publication

FAR Publishing Limited

Study workflow for case curation, respondent assignment, decision generation, and blinded evaluation, leading to statistical analysis

image: 

None

view more 

Credit: Ning Yang, Kailai L, Baiyang Liu, Xiting Chen, Aimin Jiang, Chang Qi, Wenyi Gan, Lingxuan Zhu, Weiming Mou, Dongqiang Zeng, Mingjia Xiao, Guangdi Chu, Shengkun Peng, Hank Z.H. Wong, Lin Zhang, Hengguo Zhang, Xinpei Deng, Quan Cheng, Bufu Tang, Anqi Lin, Juan Zhou, Peng Luo

Background: Despite the promise shown by large language models (LLMs) for standardized tasks, their multidimensional performance in real-world oncology decision-making remains unevaluated.

This study aims to introduce a framework for evaluating LLM and physician decisions in challenging lung cancer cases.

 

Methods: We curated 50 challenging lung cancer cases (25 local and 25 published) classified as complex, rare, or refractory. Blinded three-dimensional, five-point Likert evaluations (1–5 for comprehensiveness, specificity, and readability) compared standalone LLMs (DeepSeek R1, Claude 3.5, Gemini 1.5, and GPT-4o), physicians by experience level (junior, intermediate, and senior), and AI-assisted juniors; intergroup differences and augmentation effects were analyzed statistically.

 

Results: Of 50 challenging cases (18 complex, 17 rare, and 15 refractory) rated by three experts, DeepSeek R1 achieved scores of 3.95±0.33, 3.71±0.53, and 4.26±0.18 for comprehensiveness, specificity, and readability, respectively, positioning it between intermediate (3.68, 3.68, 3.75) and senior (4.50, 4.64, 4.53) physicians. GPT-4o and Claude 3.5 reached intermediate physician–level comprehensiveness (3.76±0.39, 3.60±0.39) but junior-to-intermediate physician–level specificity (3.39±0.39, 3.39±0.49). All LLMs scored higher on rare cases than intermediate physicians but fell below junior physicians in refractory-case specificity. AI-assisted junior physicians showed marked gains in rare cases, with comprehensiveness rising from 2.32 to 4.29 (84.8%), specificity from 2.24 to 4.26 (90.8%), and readability from 2.76 to 4.59 (66.0%), while specificity declined by 3.2% (3.17 to 3.07) in refractory cases. Error analysis showed complementary strengths, with physicians demonstrating reasoning stability and LLMs excelling in knowledge updating and risk management.

 

Conclusions: LLM performance in clinical decision-making tasks varied by case type, performing better in rare cases and worse in refractory cases requiring longitudinal reasoning. Complementary strengths between LLMs and physicians support case- and task-tailored human–AI collaboration.

 


Disclaimer: AAAS and EurekAlert! are not responsible for the accuracy of news releases posted to EurekAlert! by contributing institutions or for the use of any information through the EurekAlert system.