Understanding complex biological pathways, such as gene-gene interactions and gene regulatory networks, is crucial for exploring disease mechanisms and advancing drug development. However, manual literature curation of these pathways cannot keep pace with the exponential growth of discoveries. Large-scale language models (LLMs) trained on extensive text corpora contain rich biological information and can be leveraged as a biological knowledge graph for pathway curation.
Recently, Quantitative Biology published a study titled "A Comprehensive Evaluation of Large Language Models in Mining Gene Relations and Pathway Knowledge." This research assesses 21 large language models (LLMs), including both API-based and open-source models, in their ability to retrieve biological knowledge. The evaluation focuses on predicting gene regulatory relations (activation, inhibition, and phosphorylation) and identifying gene components in pathways, using the Kyoto Encyclopedia of Genes and Genomes (KEGG) pathway as the ground truth, as illustrated in Figure 1.
The results reveal a significant disparity in model performance, with API-based models outperforming their open-source counterparts. The findings suggest that while LLMs are informative in gene network analysis and pathway mapping, their effectiveness varies, necessitating careful model selection. GPT-4 and Claude-Pro emerged as top performers in predicting gene regulatory relations, achieving higher precision and recall rates than other models. This study underscores the importance of selecting appropriate computational tools for specific tasks in biological research. It also provides a case study illustrating the use of LLMs as knowledge graphs for data mining in general.
Journal
Quantitative Biology
DOI
Method of Research
Experimental study
Subject of Research
Not applicable
Article Title
A comprehensive evaluation of large language models in mining gene relations and pathway knowledge
Article Publication Date
19-Jun-2024