MoE-Gen: High-Throughput MoE Inference on a Single GPU with Module-Based Batching

March 12, 2025 Β· Declared Dead Β· πŸ› arXiv.org

πŸ’€ CAUSE OF DEATH: 404 Not Found
Code link is broken/dead
Authors Tairan Xu, Leyang Xue, Zhan Lu, Adrian Jackson, Luo Mai arXiv ID 2503.09716 Category cs.DC: Distributed Computing Cross-listed cs.LG Citations 3 Venue arXiv.org Repository https://github.com/EfficientMoE/MoE-Gen Last Checked 2 months ago
Abstract
This paper presents MoE-Gen, a high-throughput MoE inference system optimized for single-GPU execution. Existing inference systems rely on model-based or continuous batching strategies, originally designed for interactive inference, which result in excessively small batches for MoE's key modules-attention and expert modules-leading to poor throughput. To address this, we introduce module-based batching, which accumulates tokens in host memory and dynamically launches large batches on GPUs to maximize utilization. Additionally, we optimize the choice of batch sizes for each module in an MoE to fully overlap GPU computation and communication, maximizing throughput. Evaluation demonstrates that MoE-Gen achieves 8-31x higher throughput compared to state-of-the-art systems employing model-based batching (FlexGen, MoE-Lightning, DeepSpeed), and offers even greater throughput improvements over continuous batching systems (e.g., vLLM and Ollama) on popular MoE models (DeepSeek and Mixtral) across offline inference tasks. MoE-Gen's source code is publicly available at https://github.com/EfficientMoE/MoE-Gen
Community shame:
Not yet rated
Community Contributions

Found the code? Know the venue? Think something is wrong? Let us know!

πŸ“œ Similar Papers

In the same crypt β€” Distributed Computing

Died the same way β€” πŸ’€ 404 Not Found