Time-dependent performance evaluation of LLMs in oncology (IMAGE)
Caption
(A) Comparison of MD with 95% CI for subjective question accuracy across different LLMs over various time periods. The x-axis represents the magnitude of MD, with blue squares indicating the MD value of each study, the square size being proportional to study weight, and horizontal lines showing 95% CI. Diamonds represent the pooled MD values at the bottom of the figure. (B) Comparison of Risk Ratio (RR) with 95% CI for objective question accuracy across different LLMs over various time periods. Sample size (No. of Studies) and p-values are indicated for each subgroup. The x-axis represents the magnitude of the Risk Ratio, with blue squares indicating the RR value of each study, the square size being proportional to study weight, and horizontal lines showing 95% CI. Diamonds represent the pooled RR values at the bottom of the figure. The legend selectively displays results from fixed-effect and random-effects models based on I², including corresponding heterogeneity index I² values.
Abbreviations: LLMs, Large Language Models; MD, Mean Difference; RR, Risk Ratio; CI, Confidence Interval.
Credit
Zilin Qiu, Aimin Jiang,Chang Qi, Wenyi Gan, Lingxuan Zhu, Weiming Mou, Dongqiang Zeng, Mingjia Xiao, Guangdi Chu, Shengkun peng, Hank Z.H. Wong, Lin Zhang, Hengguo Zhang, Xinpei Deng, Quan Cheng, Bufu Tang, Yaxuan Wang, Jian Zhang, Anqi Lin, Peng Luo
Usage Restrictions
None
License
Original content