Optimizing LLM Inference Throughput via Memory-aware and SLA-constrained Dynamic Batching

March 07, 2025 Β· Declared Dead Β· πŸ› arXiv.org

πŸ“œ CAUSE OF DEATH: Death by README
Repo has only a README

Repo contents: .gitignore, LICENSE, README.md

Authors Bowen Pang, Kai Li, Feifan Wang arXiv ID 2503.05248 Category cs.DC: Distributed Computing Citations 8 Venue arXiv.org Repository https://github.com/KevinLee1110/dynamic-batching ⭐ 15 Last Checked 1 month ago
Abstract
The increasing adoption of large language models (LLMs) necessitates inference serving systems that can deliver both high throughput and low latency. Deploying LLMs with hundreds of billions of parameters on memory-constrained GPUs exposes significant limitations in static batching methods. Current inference serving systems often treat batch sizes as fixed hyper-parameters, hindering real-time adaptation to varying system conditions. In this paper, we propose a dynamic batching method that continuously monitors memory utilization and adheres to service-level agreements (SLAs) to enable real-time batch size configuration adjustment. The method comprises two core components: a memory-aware batch scheduler that dynamically allocates GPU resources and a latency feedback mechanism that optimizes decoding processes under SLA constraints. The numerical experiments demonstrate throughput gains of 8% to 28% and capacity improvements of 22% compared to traditional static batching methods, while maintaining full compatibility with existing inference infrastructure. These results highlight the effectiveness of dynamic batching in balancing computational efficiency and quality-of-service requirements for contemporary LLM deployment scenarios. The source code of this work is publicly available at https://github.com/KevinLee1110/dynamic-batching.
Community shame:
Not yet rated
Community Contributions

Found the code? Know the venue? Think something is wrong? Let us know!

πŸ“œ Similar Papers

In the same crypt β€” Distributed Computing

Died the same way β€” πŸ“œ Death by README