Mining Root Cause Knowledge from Cloud Service Incident Investigations for AIOps
April 21, 2022 ยท Declared Dead ยท ๐ 2022 IEEE/ACM 44th International Conference on Software Engineering: Software Engineering in Practice (ICSE-SEIP)
"No code URL or promise found in abstract"
Evidence collected by the PWNC Scanner
Authors
Amrita Saha, Steven C. H. Hoi
arXiv ID
2204.11598
Category
cs.IR: Information Retrieval
Cross-listed
cs.AI
Citations
33
Venue
2022 IEEE/ACM 44th International Conference on Software Engineering: Software Engineering in Practice (ICSE-SEIP)
Last Checked
3 months ago
Abstract
Root Cause Analysis (RCA) of any service-disrupting incident is one of the most critical as well as complex tasks in IT processes, especially for cloud industry leaders like Salesforce. Typically RCA investigation leverages data-sources like application error logs or service call traces. However a rich goldmine of root cause information is also hidden in the natural language documentation of the past incidents investigations by domain experts. This is generally termed as Problem Review Board (PRB) Data which constitute a core component of IT Incident Management. However, owing to the raw unstructured nature of PRBs, such root cause knowledge is not directly reusable by manual or automated pipelines for RCA of new incidents. This motivates us to leverage this widely-available data-source to build an Incident Causation Analysis (ICA) engine, using SoTA neural NLP techniques to extract targeted information and construct a structured Causal Knowledge Graph from PRB documents. ICA forms the backbone of a simple-yet-effective Retrieval based RCA for new incidents, through an Information Retrieval system to search and rank past incidents and detect likely root causes from them, given the incident symptom. In this work, we present ICA and the downstream Incident Search and Retrieval based RCA pipeline, built at Salesforce, over 2K documented cloud service incident investigations collected over a few years. We also establish the effectiveness of ICA and the downstream tasks through various quantitative benchmarks, qualitative analysis as well as domain expert's validation and real incident case studies after deployment.
Community Contributions
Found the code? Know the venue? Think something is wrong? Let us know!
๐ Similar Papers
In the same crypt โ Information Retrieval
R.I.P.
๐ป
Ghosted
R.I.P.
๐ป
Ghosted
LightGCN: Simplifying and Powering Graph Convolution Network for Recommendation
R.I.P.
๐ป
Ghosted
Graph Convolutional Neural Networks for Web-Scale Recommender Systems
๐
๐
Old Age
Neural Graph Collaborative Filtering
R.I.P.
๐ป
Ghosted
Self-Attentive Sequential Recommendation
R.I.P.
๐ป
Ghosted
DeepFM: A Factorization-Machine based Neural Network for CTR Prediction
Died the same way โ ๐ป Ghosted
R.I.P.
๐ป
Ghosted
Language Models are Few-Shot Learners
R.I.P.
๐ป
Ghosted
PyTorch: An Imperative Style, High-Performance Deep Learning Library
R.I.P.
๐ป
Ghosted
XGBoost: A Scalable Tree Boosting System
R.I.P.
๐ป
Ghosted