Enabling Highly-Scalable Remote Memory Access Programming with MPI-3 One Sided

January 21, 2020 Β· Entered Twilight Β· πŸ› 2013 SC - International Conference for High Performance Computing, Networking, Storage and Analysis (SC)

πŸŒ… TWILIGHT: Old Age
Predates the code-sharing era β€” a pioneer of its time

"No code URL or promise found in abstract"
"Code repo scraped from project page (backfill)"

Evidence collected by the PWNC Scanner

Repo contents: AUTHORS, CHANGES, LICENSE, Makefile, Makefile.inc, README, c_test.c, fompi.h, fompi_fortran.c, fompi_helper.c, fompi_internal.h, fompi_op.c.m4, fompi_overloaded.c, fompi_win_allocate.c, fompi_win_attr.c, fompi_win_create.c, fompi_win_dynamic_create.c, fompi_win_fence.c, fompi_win_free.c, fompi_win_group.c, fompi_win_lock.c, fompi_win_name.c, fompi_win_pscw.c, fompi_win_rma.c, fompif.h, fortran_test_f77.f90, fortran_test_f90.f90, libtopodisc, module_fompi.f90, mpitypes.tar.bz2, test_address

Authors Robert Gerstenberger, Maciej Besta, Torsten Hoefler arXiv ID 2001.07747 Category cs.DC: Distributed Computing Cross-listed cs.PF Citations 139 Venue 2013 SC - International Conference for High Performance Computing, Networking, Storage and Analysis (SC) Repository https://github.com/tim0s/foMPI ⭐ 1 Last Checked 16 days ago
Abstract
Modern interconnects offer remote direct memory access (RDMA) features. Yet, most applications rely on explicit message passing for communications albeit their unwanted overheads. The MPI-3.0 standard defines a programming interface for exploiting RDMA networks directly, however, it's scalability and practicability has to be demonstrated in practice. In this work, we develop scalable bufferless protocols that implement the MPI-3.0 specification. Our protocols support scaling to millions of cores with negligible memory consumption while providing highest performance and minimal overheads. To arm programmers, we provide a spectrum of performance models for all critical functions and demonstrate the usability of our library and models with several application studies with up to half a million processes. We show that our design is comparable to, or better than UPC and Fortran Coarrays in terms of latency, bandwidth, and message rate. We also demonstrate application performance improvements with comparable programming complexity.
Community shame:
Not yet rated
Community Contributions

Found the code? Know the venue? Think something is wrong? Let us know!

πŸ“œ Similar Papers

In the same crypt β€” Distributed Computing