Supporting Very Large Models using Automatic Dataflow Graph Partitioning

July 24, 2018 Β· Declared Dead Β· πŸ› European Conference on Computer Systems

πŸ‘» CAUSE OF DEATH: Ghosted
No code link whatsoever

"No code URL or promise found in abstract"

Evidence collected by the PWNC Scanner

Authors Minjie Wang, Chien-chin Huang, Jinyang Li arXiv ID 1807.08887 Category cs.DC: Distributed Computing Cross-listed cs.LG Citations 169 Venue European Conference on Computer Systems Last Checked 3 months ago
Abstract
This paper presents Tofu, a system that partitions very large DNN models across multiple GPU devices to reduce per-GPU memory footprint. Tofu is designed to partition a dataflow graph of fine-grained tensor operators in order to work transparently with a general-purpose deep learning platform like MXNet. In order to automatically partition each operator, we propose to describe the semantics of an operator in a simple language which represents tensors as lambda functions mapping from tensor coordinates to values. To optimally partition different operators in a dataflow graph, Tofu uses a recursive search algorithm that minimizes the total communication cost. Our experiments on an 8-GPU machine show that Tofu enables the training of very large CNN and RNN models. It also achieves 25% - 400% speedup over alternative approaches to train very large models.
Community shame:
Not yet rated
Community Contributions

Found the code? Know the venue? Think something is wrong? Let us know!

πŸ“œ Similar Papers

In the same crypt β€” Distributed Computing

Died the same way β€” πŸ‘» Ghosted