Scalability of 3D-DFT by block tensor-matrix multiplication on the JUWELS Cluster
March 23, 2023 Β· Declared Dead Β· π J. Parallel Distributed Comput.
"No code URL or promise found in abstract"
Evidence collected by the PWNC Scanner
Authors
Nitin Malapally, Viacheslav Bolnykh, Estela Suarez, Paolo Carloni, Thomas Lippert, Davide Mandelli
arXiv ID
2303.13337
Category
physics.comp-ph
Cross-listed
cs.DC
Citations
1
Venue
J. Parallel Distributed Comput.
Last Checked
1 month ago
Abstract
The 3D Discrete Fourier Transform (DFT) is a technique used to solve problems in disparate fields. Nowadays, the commonly adopted implementation of the 3D-DFT is derived from the Fast Fourier Transform (FFT) algorithm. However, evidence indicates that the distributed memory 3D-FFT algorithm does not scale well due to its use of all-to-all communication. Here, building on the work of Sedukhin \textit{et al}. [Proceedings of the 30th International Conference on Computers and Their Applications, CATA 2015 pp. 193-200 (01 2015)], we revisit the possibility of improving the scaling of the 3D-DFT by using an alternative approach that uses point-to-point communication, albeit at a higher arithmetic complexity. The new algorithm exploits tensor-matrix multiplications on a volumetrically decomposed domain via three specially adapted variants of Cannon's algorithm. It has here been implemented as a C++ library called S3DFT and tested on the JUWELS Cluster at the JΓΌlich Supercomputing Center. Our implementation of the shared memory tensor-matrix multiplication attained 88\% of the theoretical single node peak performance. One variant of the distributed memory tensor-matrix multiplication shows excellent scaling, while the other two show poorer performance, which can be attributed to their intrinsic communication patterns. A comparison of S3DFT with the Intel MKL and FFTW3 libraries indicates that currently iMKL performs best overall, followed in order by FFTW3 and S3DFT. This picture might change with further improvements of the algorithm and/or when running on clusters that use network connections with higher latency, e.g. on cloud platforms.
Community Contributions
Found the code? Know the venue? Think something is wrong? Let us know!
π Similar Papers
In the same crypt β physics.comp-ph
R.I.P.
π»
Ghosted
R.I.P.
π»
Ghosted
Deep Potential Molecular Dynamics: a scalable model with the accuracy of quantum mechanics
R.I.P.
π»
Ghosted
Heterogeneous Parallelization and Acceleration of Molecular Dynamics Simulations in GROMACS
R.I.P.
π»
Ghosted
By-passing the Kohn-Sham equations with machine learning
R.I.P.
π»
Ghosted
Machine Learning of coarse-grained Molecular Dynamics Force Fields
R.I.P.
π»
Ghosted
Towards Physics-informed Deep Learning for Turbulent Flow Prediction
Died the same way β π» Ghosted
R.I.P.
π»
Ghosted
Language Models are Few-Shot Learners
R.I.P.
π»
Ghosted
PyTorch: An Imperative Style, High-Performance Deep Learning Library
R.I.P.
π»
Ghosted
XGBoost: A Scalable Tree Boosting System
R.I.P.
π»
Ghosted