News Release

A 70-billion-parameter large language model tailored for chemical engineering

Peer-Reviewed Publication

Dalian Institute of Chemical Physics, Chinese Academy Sciences

Figure Abstract

image: 

ChemELLM, a 70-billion-parameter LLM tailored for chemical engineering, outperforms leading LLMs (e.g., Deepseek-R1) on ChemEBench across 101 tasks, trained on ChemEData’s 19 billion pretraining and 1 billion fine-tuning tokens, accelerating lab-to-fab innovation.

view more 

Credit: Chinese Journal of Catalysis

The development of chemical technologies is a multi-stage process that typically begins with laboratory research, progresses through scale-up and basic engineering, and culminates in industrial deployment. This complex process requires synergistic collaboration among experts from diverse disciplines such as chemistry, physics, mathematics, electrical engineering, process design, and architecture to address technical bottlenecks while balancing economic viability. However, interdisciplinary collaboration is often hindered by disciplinary boundaries, posing significant challenges to maintaining consistency in design intentions during chemical process development. Emerging strategies such as data-driven artificial intelligence (AI) technologies have gained recognition for their potential to streamline development pipelines and enhance process efficiency. Particularly, the advent of large language models (LLMs), trained on extensive corpora encapsulating complex, cross-disciplinary information, offers unprecedented opportunities to revolutionize scientific workflows.

Recently, a research team led by Prof. Mao Ye (Dalian Institute of Chemical Physics, Chinese Academy of Sciences) & Prof. Xin Li (iFLYTEK Co., Ltd.) has developed ChemELLM, a domain-specialized LLM designed for chemical engineering applications. Built upon the Spark 70B foundation model, ChemELLM underwent domain-adaptive pretraining and instruction fine-tuning using ChemEData, a carefully curated corpus of high-quality chemical engineering data. Additionally, to assess the knowledge and problem-solving capabilities of LLMs in this filed, the team introduced ChemEBench, a comprehensive benchmark designed for chemical engineering. The results were published in the Chinese Journal of Catalysis (DOI:10.1016/S1872-2067(25)64725-5).

ChemEData, a specialized dataset containing 19 billion tokens for pre-training and 1 billion tokens for fine-tuning, was constructed. Domain pre-training was conducted on the Spark-70B foundation model using a 19-billion-token chemical engineering corpus. This approach enables ChemELLM to acquire domain-specific knowledge while retaining Spark-70B's foundational capabilities. During the supervised fine-tuning phase, 2.75 million high-quality data (1 billion tokens) were utilized to align the model with the specific language patterns and terminology of chemical engineering.

The ChemEBench benchmark integrates three progressive evaluation stages-basic knowledge, advanced knowledge, and professional skills-to comprehensively assess LLMs in this specialized domain. Evaluation results highlight ChemELLM's superior performance over mainstream LLMs (including O1-Preview, GPT-4o, and DeepSeek-R1) on ChemEBench, demonstrating its excellence in chemical engineering tasks.

About the Journal

Chinese Journal of Catalysis is co-sponsored by Dalian Institute of Chemical Physics, Chinese Academy of Sciences and Chinese Chemical Society, and it is currently published by Elsevier group. This monthly journal publishes in English timely contributions of original and rigorously reviewed manuscripts covering all areas of catalysis. The journal publishes Reviews, Accounts, Communications, Articles, Highlights, Perspectives, and Viewpoints of highly scientific values that help understanding and defining of new concepts in both fundamental issues and practical applications of catalysis. Chinese Journal of Catalysis ranks among the top one journals in Applied Chemistry with a current SCI impact factor of 15.7. The Editors-in-Chief are Profs. Can Li and Tao Zhang.

At Elsevier http://www.journals.elsevier.com/chinese-journal-of-catalysis

Manuscript submission https://mc03.manuscriptcentral.com/cjcatal


Disclaimer: AAAS and EurekAlert! are not responsible for the accuracy of news releases posted to EurekAlert! by contributing institutions or for the use of any information through the EurekAlert system.