image: None
Credit: Ning Yang, Kailai L, Baiyang Liu, Xiting Chen, Aimin Jiang, Chang Qi, Wenyi Gan, Lingxuan Zhu, Weiming Mou, Dongqiang Zeng, Mingjia Xiao, Guangdi Chu, Shengkun Peng, Hank Z.H. Wong, Lin Zhang, Hengguo Zhang, Xinpei Deng, Quan Cheng, Bufu Tang, Anqi Lin, Juan Zhou, Peng Luo
Background: Despite the promise shown by large language models (LLMs) for standardized tasks, their multidimensional performance in real-world oncology decision-making remains unevaluated.
This study aims to introduce a framework for evaluating LLM and physician decisions in challenging lung cancer cases.
Methods: We curated 50 challenging lung cancer cases (25 local and 25 published) classified as complex, rare, or refractory. Blinded three-dimensional, five-point Likert evaluations (1–5 for comprehensiveness, specificity, and readability) compared standalone LLMs (DeepSeek R1, Claude 3.5, Gemini 1.5, and GPT-4o), physicians by experience level (junior, intermediate, and senior), and AI-assisted juniors; intergroup differences and augmentation effects were analyzed statistically.
Results: Of 50 challenging cases (18 complex, 17 rare, and 15 refractory) rated by three experts, DeepSeek R1 achieved scores of 3.95±0.33, 3.71±0.53, and 4.26±0.18 for comprehensiveness, specificity, and readability, respectively, positioning it between intermediate (3.68, 3.68, 3.75) and senior (4.50, 4.64, 4.53) physicians. GPT-4o and Claude 3.5 reached intermediate physician–level comprehensiveness (3.76±0.39, 3.60±0.39) but junior-to-intermediate physician–level specificity (3.39±0.39, 3.39±0.49). All LLMs scored higher on rare cases than intermediate physicians but fell below junior physicians in refractory-case specificity. AI-assisted junior physicians showed marked gains in rare cases, with comprehensiveness rising from 2.32 to 4.29 (84.8%), specificity from 2.24 to 4.26 (90.8%), and readability from 2.76 to 4.59 (66.0%), while specificity declined by 3.2% (3.17 to 3.07) in refractory cases. Error analysis showed complementary strengths, with physicians demonstrating reasoning stability and LLMs excelling in knowledge updating and risk management.
Conclusions: LLM performance in clinical decision-making tasks varied by case type, performing better in rare cases and worse in refractory cases requiring longitudinal reasoning. Complementary strengths between LLMs and physicians support case- and task-tailored human–AI collaboration.
Journal
Intelligent Oncology
Method of Research
Observational study
Subject of Research
Not applicable
Article Title
Decision-making performance of large language models vs. Human physicians in challenging lung cancer cases: A real-world case-based study
Article Publication Date
26-Jan-2026
COI Statement
The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.