Pre-train and Search: Efficient Embedding Table Sharding with Pre-trained Neural Cost Models

May 03, 2023 ยท Entered Twilight ยท ๐Ÿ› Conference on Machine Learning and Systems

๐Ÿ’ค TWILIGHT: Eternal Rest
Repo abandoned since publication

Repo contents: .gitignore, LICENSE, README.md, collect_comm_cost_data.py, collect_compute_cost_data.py, eval.py, eval_simulator.py, neuroshard, overview.png, pretrain_search.png, setup.py, tools, train_comm_cost_model.py, train_compute_cost_model.py

Authors Daochen Zha, Louis Feng, Liang Luo, Bhargav Bhushanam, Zirui Liu, Yusuo Hu, Jade Nie, Yuzhen Huang, Yuandong Tian, Arun Kejariwal, Xia Hu arXiv ID 2305.01868 Category cs.LG: Machine Learning Cross-listed cs.DC, cs.IR, cs.PF Citations 10 Venue Conference on Machine Learning and Systems Repository https://github.com/daochenzha/neuroshard โญ 16 Last Checked 1 month ago
Abstract
Sharding a large machine learning model across multiple devices to balance the costs is important in distributed training. This is challenging because partitioning is NP-hard, and estimating the costs accurately and efficiently is difficult. In this work, we explore a "pre-train, and search" paradigm for efficient sharding. The idea is to pre-train a universal and once-for-all neural network to predict the costs of all the possible shards, which serves as an efficient sharding simulator. Built upon this pre-trained cost model, we then perform an online search to identify the best sharding plans given any specific sharding task. We instantiate this idea in deep learning recommendation models (DLRMs) and propose NeuroShard for embedding table sharding. NeuroShard pre-trains neural cost models on augmented tables to cover various sharding scenarios. Then it identifies the best column-wise and table-wise sharding plans with beam search and greedy grid search, respectively. Experiments show that NeuroShard significantly and consistently outperforms the state-of-the-art on the benchmark sharding dataset, achieving up to 23.8% improvement. When deployed in an ultra-large production DLRM with multi-terabyte embedding tables, NeuroShard achieves 11.6% improvement in embedding costs over the state-of-the-art, which translates to 6.6% end-to-end training throughput improvement. To facilitate future research of the "pre-train, and search" paradigm in ML for Systems, we open-source our code at https://github.com/daochenzha/neuroshard
Community shame:
Not yet rated
Community Contributions

Found the code? Know the venue? Think something is wrong? Let us know!

๐Ÿ“œ Similar Papers

In the same crypt โ€” Machine Learning