Loop Tiling in Large-Scale Stencil Codes at Run-time with OPS
April 03, 2017 ยท Declared Dead ยท ๐ IEEE Transactions on Parallel and Distributed Systems
"No code URL or promise found in abstract"
Evidence collected by the PWNC Scanner
Authors
Istvan Z Reguly, Gihan R Mudalige, Mike B Giles
arXiv ID
1704.00693
Category
cs.PF: Performance
Cross-listed
cs.DC,
cs.DS
Citations
48
Venue
IEEE Transactions on Parallel and Distributed Systems
Last Checked
1 month ago
Abstract
The key common bottleneck in most stencil codes is data movement, and prior research has shown that improving data locality through optimisations that schedule across loops do particularly well. However, in many large PDE applications it is not possible to apply such optimisations through compilers because there are many options, execution paths and data per grid point, many dependent on run-time parameters, and the code is distributed across different compilation units. In this paper, we adapt the data locality improving optimisation called iteration space slicing for use in large OPS applications both in shared-memory and distributed-memory systems, relying on run-time analysis and delayed execution. We evaluate our approach on a number of applications, observing speedups of 2$\times$ on the Cloverleaf 2D/3D proxy application, which contain 83/141 loops respectively, $3.5\times$ on the linear solver TeaLeaf, and $1.7\times$ on the compressible Navier-Stokes solver OpenSBLI. We demonstrate strong and weak scalability up to 4608 cores of CINECA's Marconi supercomputer. We also evaluate our algorithms on Intel's Knights Landing, demonstrating maintained throughput as the problem size grows beyond 16GB, and we do scaling studies up to 8704 cores. The approach is generally applicable to any stencil DSL that provides per loop data access information.
Community Contributions
Found the code? Know the venue? Think something is wrong? Let us know!
๐ Similar Papers
In the same crypt โ Performance
R.I.P.
๐ป
Ghosted
R.I.P.
๐ป
Ghosted
A General Formula for the Stationary Distribution of the Age of Information and Its Application to Single-Server Queues
R.I.P.
๐ป
Ghosted
AI Benchmark: All About Deep Learning on Smartphones in 2019
R.I.P.
๐ป
Ghosted
BestConfig: Tapping the Performance Potential of Systems via Automatic Configuration Tuning
R.I.P.
๐ป
Ghosted
Online normalizer calculation for softmax
R.I.P.
๐ป
Ghosted
CLTune: A Generic Auto-Tuner for OpenCL Kernels
Died the same way โ ๐ป Ghosted
R.I.P.
๐ป
Ghosted
Language Models are Few-Shot Learners
R.I.P.
๐ป
Ghosted
PyTorch: An Imperative Style, High-Performance Deep Learning Library
R.I.P.
๐ป
Ghosted
XGBoost: A Scalable Tree Boosting System
R.I.P.
๐ป
Ghosted