Back to Browse

SlimQwen: Optimizing Large MoE Model Compression Through Pruning and Distillation

48 views
May 12, 2026
6:50

Introducing the SlimQwen framework for efficiently compressing large-scale Mixture-of-Experts (MoE) language models. The researchers proved that **structural Pruning** provides better initialization performance than learning from the bottom, and that the gradual compression schedule minimizes information loss. In particular, we propose a partial preservation strategy that selectively preserves and merges experts and a multi-token prediction (MTP) distillation technique that is advantageous for knowledge-intensive work. Through this methodology, we succeeded in maintaining strong performance while compressing the Qwen3-Next-80A3B model to 23A2B, which is about a quarter of the size. As a result, this paper provides a practical guide to maximize the learning and reasoning efficiency of large-scale MoE models. https://arxiv.org/pdf/2605.08738

Download

0 formats

No download links available.

SlimQwen: Optimizing Large MoE Model Compression Through Pruning and Distillation | NatokHD