Collie: Finding Performance Anomalies in RDMA Subsystems
April 22, 2023 ยท Declared Dead ยท ๐ Symposium on Networked Systems Design and Implementation
"No code URL or promise found in abstract"
Evidence collected by the PWNC Scanner
Authors
Xinhao Kong, Yibo Zhu, Huaping Zhou, Zhuo Jiang, Jianxi Ye, Chuanxiong Guo, Danyang Zhuo
arXiv ID
2304.11467
Category
cs.NI: Networking & Internet
Citations
67
Venue
Symposium on Networked Systems Design and Implementation
Last Checked
3 months ago
Abstract
High-speed RDMA networks are getting rapidly adopted in the industry for their low latency and reduced CPU overheads. To verify that RDMA can be used in production, system administrators need to understand the set of application workloads that can potentially trigger abnormal performance behaviors (e.g., unexpected low throughput, PFC pause frame storm). We design and implement Collie, a tool for users to systematically uncover performance anomalies in RDMA subsystems without the need to access hardware internal designs. Instead of individually testing each hardware device (e.g., NIC, memory, PCIe), Collie is holistic, constructing a comprehensive search space for application workloads. Collie then uses simulated annealing to drive RDMA-related performance and diagnostic counters to extreme value regions to find workloads that can trigger performance anomalies. We evaluate Collie on combinations of various RDMA NIC, CPU, and other hardware components. Collie found 15 new performance anomalies. All of them are acknowledged by the hardware vendors. 7 of them are already fixed after we reported them. We also present our experience in using Collie to avoid performance anomalies for an RDMA RPC library and an RDMA distributed machine learning framework.
Community Contributions
Found the code? Know the venue? Think something is wrong? Let us know!
๐ Similar Papers
In the same crypt โ Networking & Internet
R.I.P.
๐ป
Ghosted
R.I.P.
๐ป
Ghosted
Federated Learning in Mobile Edge Networks: A Comprehensive Survey
R.I.P.
๐ป
Ghosted
A Survey of Indoor Localization Systems and Technologies
R.I.P.
๐ป
Ghosted
Survey of Important Issues in UAV Communication Networks
R.I.P.
๐ป
Ghosted
Network Function Virtualization: State-of-the-art and Research Challenges
R.I.P.
๐ป
Ghosted
Applications of Deep Reinforcement Learning in Communications and Networking: A Survey
Died the same way โ ๐ป Ghosted
R.I.P.
๐ป
Ghosted
Language Models are Few-Shot Learners
R.I.P.
๐ป
Ghosted
PyTorch: An Imperative Style, High-Performance Deep Learning Library
R.I.P.
๐ป
Ghosted
XGBoost: A Scalable Tree Boosting System
R.I.P.
๐ป
Ghosted