News Release

Temporal evolution of large language models in oncology: performance trends of ChatGPT-3.5, ChatGPT-4, and Gemini

Peer-Reviewed Publication

FAR Publishing Limited

Time-dependent performance evaluation of LLMs in oncology

image: 

(A) Comparison of MD with 95% CI for subjective question accuracy across different LLMs over various time periods. The x-axis represents the magnitude of MD, with blue squares indicating the MD value of each study, the square size being proportional to study weight, and horizontal lines showing 95% CI. Diamonds represent the pooled MD values at the bottom of the figure. (B) Comparison of Risk Ratio (RR) with 95% CI for objective question accuracy across different LLMs over various time periods. Sample size (No. of Studies) and p-values are indicated for each subgroup. The x-axis represents the magnitude of the Risk Ratio, with blue squares indicating the RR value of each study, the square size being proportional to study weight, and horizontal lines showing 95% CI. Diamonds represent the pooled RR values at the bottom of the figure. The legend selectively displays results from fixed-effect and random-effects models based on I², including corresponding heterogeneity index I² values.

Abbreviations: LLMs, Large Language Models; MD, Mean Difference; RR, Risk Ratio; CI, Confidence Interval.

view more 

Credit: Zilin Qiu, Aimin Jiang,Chang Qi, Wenyi Gan, Lingxuan Zhu, Weiming Mou, Dongqiang Zeng, Mingjia Xiao, Guangdi Chu, Shengkun peng, Hank Z.H. Wong, Lin Zhang, Hengguo Zhang, Xinpei Deng, Quan Cheng, Bufu Tang, Yaxuan Wang, Jian Zhang, Anqi Lin, Peng Luo

Large language models (LLMs) have emerged as transformative tools in healthcare, offering potential value in oncology for information retrieval, clinical decision support, and patient communication. However, the dynamic nature of oncological knowledge—including evolving treatment guidelines and diagnostic standards—raises questions about how LLMs’ performance holds up over time, especially as these models are relied on for increasingly nuanced clinical tasks.

This study, conducted in adherence to PRISMA guidelines, systematically collected relevant literature through 2025 from PubMed, Google Scholar, and Web of Science databases. The research focused on three prominent LLMs: ChatGPT-3.5, ChatGPT-4, and Gemini. Researchers analyzed 614 oncology questions spanning common malignancies (e.g., lung, breast, colorectal cancer) and rare tumors (e.g., glioma, multiple myeloma), using both original study scoring criteria and a standardized five-point Likert scale to assess response accuracy.

Key findings reveal clear divergent temporal trends across the models:

  1. ChatGPT-3.5 showed a consistent decline in performance (subjective questions: MD=-3.30; objective questions: RR=0.92). A notable turning point occurred between March 14 and April 26, 2023, where the model’s responses to new questions shifted from outperforming baseline queries to underperforming, with this performance gap continuing to widen over subsequent months.
  2. ChatGPT-4 exhibited a more pronounced drop in accuracy, with statistically significant declines observed (subjective questions: MD=-7.17; objective questions: RR=0.93), even as a more advanced iteration of the ChatGPT series.
  3. In contrast, Gemini demonstrated steady and significant improvement in oncology question-answering (subjective questions: MD=11.48; objective questions: RR=1.15), outpacing the ChatGPT models as time progressed.

Subjective questions—those requiring complex analysis, integration of clinical context, and nuanced judgment—were far more susceptible to temporal performance degradation than objective, fact-based queries. This disparity highlights the unique challenges LLMs face in applying evolving clinical knowledge to real-world oncology scenarios, where flexibility and alignment with the latest standards are critical.

The study’s results provide vital guidance for the responsible deployment of LLMs in oncology. As healthcare systems increasingly adopt these AI tools to support patient care and clinical decision-making, ongoing performance monitoring, standardized evaluation protocols, and strategies to integrate up-to-date clinical data will be essential to ensure safety and reliability.


Disclaimer: AAAS and EurekAlert! are not responsible for the accuracy of news releases posted to EurekAlert! by contributing institutions or for the use of any information through the EurekAlert system.