News Release

Beyond bigger models: How efficient multimodal AI is redefining the future of intelligence

Peer-Reviewed Publication

Tsinghua University Press

A generalized architectural blueprint for building efficient MLLMs.

image: 

A generalized architectural blueprint for building efficient MLLMs. This template achieves efficiency through a combination of component choices and data flow optimization. Key strategies include: (1) Lightweight vision encoder: Employing a smaller vision backbone to reduce the initial cost of feature extraction. (2) Vision token compression: A critical step that reduces the number of visual tokens, significantly decreasing the sequence length and computational load on the subsequent language model. (3) Efficient vision-language projector: Utilizing a low-parameter projector to align visual and textual modalities with minimal overhead. (4) Compact language model backbone: Using a compact LLM backbone (e.g., 1b ∼3b parameters) as the central reasoning component. It is crucial to note that this diagram illustrates the structural approach; further significant efficiency gains are achieved via model compression techniques such as quantization and pruning, which are applied to the weights of both the vision encoder and the language model. However, these techniques will not be explicitly discussed here in order to maintain clarity.

view more 

Credit: Visual Intelligence

The rapid success of large multimodal models has largely followed a simple rule: bigger models trained on more data tend to perform better. However, this scaling strategy comes at a steep price. Training and running state-of-the-art multimodal models often require enormous computing resources, high energy consumption, and centralized cloud infrastructure, creating barriers for researchers and limiting practical deployment. These challenges are especially pronounced in multimodal systems, where visual inputs generate long token sequences that dramatically increase computational complexity. Based on these challenges, there is a clear need to conduct in-depth research on efficient multimodal large language models.

Researchers from Shanghai Jiao Tong University and collaborating institutions reported a comprehensive survey on efficient multimodal large language models, published (DOI: 10.1007/s44267-025-00099-6) in Visual Intelligence in December 2025. The review systematically analyzes recent progress in designing lighter, faster, and more resource-efficient multimodal AI systems that integrate vision and language. By organizing advances in model architecture, training strategies, data efficiency, and real-world applications, the study provides a structured overview of how multimodal intelligence can be scaled down for broader accessibility while maintaining strong reasoning and perception capabilities.

The study reveals that improving multimodal efficiency requires solutions that go beyond traditional language-model compression. A central challenge lies in handling visual tokens, which can number in the hundreds or thousands for a single image and dramatically increase computational cost. To address this, researchers highlight vision token compression techniques that reduce redundant visual information before it reaches the language model, significantly lowering inference complexity.

Another key strategy involves rethinking model architecture. Lightweight vision encoders, compact language backbones, and efficient vision–language projectors are shown to play crucial roles in balancing performance and resource use. The review also emphasizes emerging architectures such as mixture-of-experts and Transformer alternatives, which selectively activate model components to increase capacity without proportional increases in computation.

Beyond architecture, training strategies and efficiency-aware datasets are identified as critical enablers. Instruction tuning, parameter-efficient fine-tuning, and carefully designed benchmarks allow smaller models to retain strong generalization abilities. Together, these approaches demonstrate that efficiency is not achieved through a single technique, but through coordinated optimization across the entire multimodal pipeline.

According to the authors, the shift toward efficient multimodal models represents more than technical optimization. “Efficiency determines who can build, deploy, and benefit from multimodal AI,” said Prof. Lizhuang Ma, the team leader of the study. By lowering computational barriers, efficient models democratize access to advanced AI capabilities while addressing concerns about energy consumption, privacy, and centralized control. The authors emphasize that efficiency-oriented research also enables multimodal systems to operate in real-time and resource-limited environments, opening new opportunities for responsible and inclusive AI development.

Efficient multimodal large language models have far-reaching implications across science, industry, and society. By reducing memory and computation requirements, these models can be deployed on mobile devices, autonomous systems, and edge platforms where cloud access is limited or undesirable. This enables practical applications in healthcare, remote sensing, document analysis, and intelligent assistants while improving data privacy and responsiveness. More broadly, the research suggests a future where progress in AI is measured not only by scale, but by how effectively intelligence can be delivered under real-world constraints. Efficient multimodal models may thus define the next phase of AI innovation.

Funding information

This work was supported by the National Natural Science Foundation of China (Nos. 62302167, U23A20343 and 72192821).


About the Authors

Dr. Lizhuang Ma received his B.S. and Ph.D. degrees from Zhejiang University, China in 1985 and 1991, respectively. He is now a Distinguished Professor, at the Department of Computer Science and Engineering, Shanghai Jiao Tong University, China and the School of Computer Science and Technology, East China Normal University, China. His research interests include computer vision, computer aided geometric design, computer graphics, scientific data visualization, computer animation, digital media technology, and theory and applications for computer graphics, CAD/CAM. He is the fellow of CSIG.

Dr. Xin Tan is currently the Research Professor (Zijiang Young Scholar) with School of Computer Science and Technology, East China Normal University, China. He is also the Associate Research Professor at Shanghai AI Laboratory. Before that, he was the Associate Research Professor at ECNU. He received dual Ph.D. degrees in Computer Science from Shanghai Jiao Tong University and City University of Hong Kong in 2022. His research interests lie in 3D Vision and Trustworthy Embodied AI. He serves as a program committee member/reviewer for CVPR, ICCV, ECCV, AAAI, IJCAI, IEEE TPAMI, TIP and IJCV. He also serves as the associate editor for Pattern Recognition and Visual Computer.

About Visual Intelligence

Visual Intelligence is an international, peer-reviewed, open-access journal devoted to the theory and practice of visual intelligence. This journal is the official publication of the China Society of Image and Graphics (CSIG), with Article Processing Charges fully covered by the Society. It focuses on the foundations of visual computing, the methodologies employed in the field, and the applications of visual intelligence, while particularly encouraging submissions that address rapidly advancing areas of visual intelligence research.


Disclaimer: AAAS and EurekAlert! are not responsible for the accuracy of news releases posted to EurekAlert! by contributing institutions or for the use of any information through the EurekAlert system.