Boosting Performance of Iterative Applications on GPUs: Kernel Batching with CUDA Graphs

January 16, 2025 · Declared Dead · 🏛 International Euromicro Conference on Parallel, Distributed and Network-Based Processing

"No code URL or promise found in abstract"

Evidence collected by the PWNC Scanner

Authors Jonah Ekelund, Stefano Markidis, Ivy Peng arXiv ID 2501.09398 Category cs.DC: Distributed Computing Citations 5 Venue International Euromicro Conference on Parallel, Distributed and Network-Based Processing Last Checked 3 months ago

Abstract

Graphics Processing Units (GPUs) have become the standard in accelerating scientific applications on heterogeneous systems. However, as GPUs are getting faster, one potential performance bottleneck with GPU-accelerated applications is the overhead from launching several fine-grained kernels. CUDA Graph addresses these performance challenges by enabling a graph-based execution model that captures operations as nodes and dependence as edges in a static graph. Thereby consolidating several kernel launches into one graph launch. We propose a performance optimization strategy for iteratively launched kernels. By grouping kernel launches into iteration batches and then unrolling these batches into a CUDA Graph, iterative applications can benefit from CUDA Graph for performance boosting. We analyze the performance gain and overhead from this approach by designing a skeleton application. The skeleton application also serves as a generalized example of converting an iterative solver to CUDA Graph, and for deriving a performance model. Using the skeleton application, we show that when unrolling iteration batches for a given platform, there is an optimal size of the iteration batch, which is independent of workload, balancing the extra overhead from graph creation with the performance gain of the graph execution. Depending on workload, we show that the optimal iteration batch size gives more than 1.4x speed-up in the skeleton application. Furthermore, we show that similar speed-up can be gained in Hotspot and Hotspot3D from the Rodinia benchmark suite and a Finite-Difference Time-Domain (FDTD) Maxwell solver.