Thumbnail Title
Making Large Language Models More Efficient for Real-World Deployment

Prof. WANG Wei Received ACM EuroSys 2025 Best Paper Award

Content Banner
Prof. Wang Wei’s co-authored paper, titled “SpInfer: Leveraging Low-Level Sparsity for Efficient Large Language Model Inference on GPUs”, is one of the only two Best Papers recognized at ACM EuroSys 2025 from around 700 entries.
Prof. Wang Wei’s co-authored paper, titled “SpInfer: Leveraging Low-Level Sparsity for Efficient Large Language Model Inference on GPUs”, is one of the only two Best Papers recognized at ACM EuroSys 2025 from around 700 entries. [Download Photo]
Body

Prof. WANG Wei, Associate Professor of the Department of Computer Science and Engineering at HKUST, was honored with an ACM EuroSys 2025 Best Paper Award at the 20th European Conference on Computer Systems (EuroSys) for his co-authored paper on “SpInfer: Leveraging Low-Level Sparsity for Efficient Large Language Model Inference on GPUs”.

The paper was co-authored by first author FAN Ruibo (PhD student at HKUST(GZ) co-supervised by Prof. Wang and Prof. CHU Xiaowen of HKUST(GZ)) as well as other collaborators from HKUST(GZ) and Harbin Institute of Technology, Shenzhen (HITSZ).

EuroSys is a premier conference on systems software research and development, known for its rigorous selection process. This year, only two papers out of 85 accepted submissions—selected from a total of 696 entries—earned this recognition, reflecting a highly competitive acceptance rate of 12%. It was held in Rotterdam, Netherlands, from March 30 to April 3, 2025.

In their groundbreaking work, Prof. Wang and his collaborators tackled the pressing challenge of making large language models (LLMs) more efficient for real-world deployment. LLMs, while powerful, require enormous computational resources, making them difficult to run on standard hardware. The team developed SpInfer, a novel framework that uses advanced “pruning” techniques to remove less important parts of these models, reducing both memory usage and computation time. By introducing a new way to store and process these pruned models—optimized for modern GPUs—SpInfer achieves significant speed and memory improvements, making it possible to run large AI models faster and more cost-effectively than ever before. Importantly, SpInfer is the first to successfully turn the theoretical promise of unstructured pruning into real-world performance gains for LLM inference, setting a new standard for efficiency in AI systems.

Related link: