Autonomous Monitors for Detecting Failures Early and Reporting Interpretable Alerts in Cloud Operations
(2024) 46th International Conference on Software Engineering, ICSE 2024- Abstract
- Detecting failures early in cloud-based software systems is highly significant as it can reduce operational costs, enhance service reliability, and improve user experience. Many existing approaches include anomaly detection in metrics or a blend of metric and log features. However, such approaches tend to be very complex and hardly explainable, and consequently non-trivial for implementation and evaluation in industrial contexts. In collaboration with a case company and their cloud-based system in the domain of PIM (Product Information Management), we propose and implement autonomous monitors for proactive monitoring across multiple services of distributed software architecture, fused with anomaly detection in performance metrics and log... (More)
- Detecting failures early in cloud-based software systems is highly significant as it can reduce operational costs, enhance service reliability, and improve user experience. Many existing approaches include anomaly detection in metrics or a blend of metric and log features. However, such approaches tend to be very complex and hardly explainable, and consequently non-trivial for implementation and evaluation in industrial contexts. In collaboration with a case company and their cloud-based system in the domain of PIM (Product Information Management), we propose and implement autonomous monitors for proactive monitoring across multiple services of distributed software architecture, fused with anomaly detection in performance metrics and log analysis using GPT-3. We demonstrated that operations engineers tend to be more efficient by having access to interpretable alert notifications based on detected anomalies that contain information about implications and potential root causes. Additionally, proposed autonomous monitors turned out to be beneficial for the timely identification and revision of potential issues before they propagate and cause severe consequences. (Less)
Please use this url to cite or link to this publication:
https://lup.lub.lu.se/record/4e3a4332-ab61-45bb-aa7f-016634d7520b
- author
- Hrusto, Adha
LU
; Runeson, Per LU
and Ohlsson, Magnus C LU
- organization
- publishing date
- 2024
- type
- Chapter in Book/Report/Conference proceeding
- publication status
- published
- subject
- host publication
- 46th International Conference on Software Engineering: Software Engineering in Practice
- pages
- 11 pages
- publisher
- Association for Computing Machinery (ACM)
- conference name
- 46th International Conference on Software Engineering, ICSE 2024
- conference location
- Lisbon, Portugal
- conference dates
- 2024-04-14 - 2024-04-20
- ISBN
- 979-8-4007-0501-4/24/04
- DOI
- 10.1145/3639477.3639712
- project
- Continuous system testing using autonomous monitors
- language
- English
- LU publication?
- yes
- id
- 4e3a4332-ab61-45bb-aa7f-016634d7520b
- date added to LUP
- 2024-05-28 09:10:01
- date last changed
- 2024-06-24 10:31:49
@inproceedings{4e3a4332-ab61-45bb-aa7f-016634d7520b, abstract = {{Detecting failures early in cloud-based software systems is highly significant as it can reduce operational costs, enhance service reliability, and improve user experience. Many existing approaches include anomaly detection in metrics or a blend of metric and log features. However, such approaches tend to be very complex and hardly explainable, and consequently non-trivial for implementation and evaluation in industrial contexts. In collaboration with a case company and their cloud-based system in the domain of PIM (Product Information Management), we propose and implement autonomous monitors for proactive monitoring across multiple services of distributed software architecture, fused with anomaly detection in performance metrics and log analysis using GPT-3. We demonstrated that operations engineers tend to be more efficient by having access to interpretable alert notifications based on detected anomalies that contain information about implications and potential root causes. Additionally, proposed autonomous monitors turned out to be beneficial for the timely identification and revision of potential issues before they propagate and cause severe consequences.}}, author = {{Hrusto, Adha and Runeson, Per and Ohlsson, Magnus C}}, booktitle = {{46th International Conference on Software Engineering: Software Engineering in Practice}}, isbn = {{979-8-4007-0501-4/24/04}}, language = {{eng}}, publisher = {{Association for Computing Machinery (ACM)}}, title = {{Autonomous Monitors for Detecting Failures Early and Reporting Interpretable Alerts in Cloud Operations}}, url = {{https://lup.lub.lu.se/search/files/187867263/ICSE_2024_SEIP_Monitoring_and_Anomaly_Detection.pdf}}, doi = {{10.1145/3639477.3639712}}, year = {{2024}}, }