Deoxys: A Causal Inference Engine for Unhealthy Node Mitigation in Large-scale Cloud Infrastructure
October 23, 2024 ยท Declared Dead ยท ๐ ACM Symposium on Cloud Computing
"No code URL or promise found in abstract"
Evidence collected by the PWNC Scanner
Authors
Chaoyun Zhang, Randolph Yao, Si Qin, Ze Li, Shekhar Agrawal, Binit R. Mishra, Tri Tran, Minghua Ma, Qingwei Lin, Murali Chintalapati, Dongmei Zhang
arXiv ID
2410.17709
Category
eess.SY: Systems & Control (EE)
Cross-listed
cs.DC
Citations
4
Venue
ACM Symposium on Cloud Computing
Last Checked
2 months ago
Abstract
The presence of unhealthy nodes in cloud infrastructure signals the potential failure of machines, which can significantly impact the availability and reliability of cloud services, resulting in negative customer experiences. Effectively addressing unhealthy node mitigation is therefore vital for sustaining cloud system performance. This paper introduces Deoxys, a causal inference engine tailored to recommending mitigation actions for unhealthy node in cloud systems to minimize virtual machine downtime and interruptions during unhealthy events. It employs double machine learning combined with causal forest to produce precise and reliable mitigation recommendations based solely on limited observational data collected from the historical unhealthy events. To enhance the causal inference model, Deoxys further incorporates a policy fallback mechanism based on model uncertainty and action overriding mechanisms to (i) improve the reliability of the system, and (ii) strike a good tradeoff between downtime reduction and resource utilization, thereby enhancing the overall system performance. After deploying Deoxys in a large-scale cloud infrastructure at Microsoft, our observations demonstrate that Deoxys significantly reduces average VM downtime by 53% compared to a legacy policy, while leading to 49.5% lower VM interruption rate. This substantial improvement enhances the reliability and stability of cloud platforms, resulting in a seamless customer experience.
Community Contributions
Found the code? Know the venue? Think something is wrong? Let us know!
๐ Similar Papers
In the same crypt โ Systems & Control (EE)
R.I.P.
๐ป
Ghosted
R.I.P.
๐ป
Ghosted
Incremental Gradient, Subgradient, and Proximal Methods for Convex Optimization: A Survey
R.I.P.
๐ป
Ghosted
Wireless Network Design for Control Systems: A Survey
R.I.P.
๐ป
Ghosted
Learning-based Model Predictive Control for Safe Exploration
R.I.P.
๐ป
Ghosted
Safety-Critical Model Predictive Control with Discrete-Time Control Barrier Function
R.I.P.
๐ป
Ghosted
Novel Multidimensional Models of Opinion Dynamics in Social Networks
Died the same way โ ๐ป Ghosted
R.I.P.
๐ป
Ghosted
Language Models are Few-Shot Learners
R.I.P.
๐ป
Ghosted
PyTorch: An Imperative Style, High-Performance Deep Learning Library
R.I.P.
๐ป
Ghosted
XGBoost: A Scalable Tree Boosting System
R.I.P.
๐ป
Ghosted