Stream-K: Work-centric Parallel Decomposition for Dense Matrix-Matrix Multiplication on the GPU

January 09, 2023 Β· Declared Dead Β· πŸ› ACM SIGPLAN Symposium on Principles & Practice of Parallel Programming

πŸ‘» CAUSE OF DEATH: Ghosted
No code link whatsoever

"No code URL or promise found in abstract"

Evidence collected by the PWNC Scanner

Authors Muhammad Osama, Duane Merrill, Cris Cecka, Michael Garland, John D. Owens arXiv ID 2301.03598 Category cs.DS: Data Structures & Algorithms Cross-listed cs.DC Citations 37 Venue ACM SIGPLAN Symposium on Principles & Practice of Parallel Programming Last Checked 3 months ago
Abstract
We introduce Stream-K, a work-centric parallelization of matrix multiplication (GEMM) and related computations in dense linear algebra. Whereas contemporary decompositions are primarily tile-based, our method operates by partitioning an even share of the aggregate inner loop iterations among physical processing elements. This provides a near-perfect utilization of computing resources, regardless of how efficiently the output tiling for any given problem quantizes across the underlying processing elements. On GPU processors, our Stream-K parallelization of GEMM produces a peak speedup of up to 14$\times$ and 6.7$\times$, and an average performance response that is both higher and more consistent across 32,824 GEMM problem geometries than state-of-the-art math libraries such as CUTLASS and cuBLAS. Furthermore, we achieve this performance from a single tile size configuration per floating-point precision, whereas today's math libraries employ complex kernel-selection heuristics to select from a large ensemble of kernel variants.
Community shame:
Not yet rated
Community Contributions

Found the code? Know the venue? Think something is wrong? Let us know!

πŸ“œ Similar Papers

In the same crypt β€” Data Structures & Algorithms

Died the same way β€” πŸ‘» Ghosted