Skip to main content

Lund University Publications

LUND UNIVERSITY LIBRARIES

Autonomous Monitors for Detecting Failures Early and Reporting Interpretable Alerts in Cloud Operations

Hrusto, Adha LU orcid ; Runeson, Per LU orcid and Ohlsson, Magnus C LU (2024) 46th International Conference on Software Engineering, ICSE 2024
Abstract
Detecting failures early in cloud-based software systems is highly significant as it can reduce operational costs, enhance service reliability, and improve user experience. Many existing approaches include anomaly detection in metrics or a blend of metric and log features. However, such approaches tend to be very complex and hardly explainable, and consequently non-trivial for implementation and evaluation in industrial contexts. In collaboration with a case company and their cloud-based system in the domain of PIM (Product Information Management), we propose and implement autonomous monitors for proactive monitoring across multiple services of distributed software architecture, fused with anomaly detection in performance metrics and log... (More)
Detecting failures early in cloud-based software systems is highly significant as it can reduce operational costs, enhance service reliability, and improve user experience. Many existing approaches include anomaly detection in metrics or a blend of metric and log features. However, such approaches tend to be very complex and hardly explainable, and consequently non-trivial for implementation and evaluation in industrial contexts. In collaboration with a case company and their cloud-based system in the domain of PIM (Product Information Management), we propose and implement autonomous monitors for proactive monitoring across multiple services of distributed software architecture, fused with anomaly detection in performance metrics and log analysis using GPT-3. We demonstrated that operations engineers tend to be more efficient by having access to interpretable alert notifications based on detected anomalies that contain information about implications and potential root causes. Additionally, proposed autonomous monitors turned out to be beneficial for the timely identification and revision of potential issues before they propagate and cause severe consequences. (Less)
Please use this url to cite or link to this publication:
author
; and
organization
publishing date
type
Chapter in Book/Report/Conference proceeding
publication status
published
subject
host publication
46th International Conference on Software Engineering: Software Engineering in Practice
pages
11 pages
publisher
Association for Computing Machinery (ACM)
conference name
46th International Conference on Software Engineering, ICSE 2024
conference location
Lisbon, Portugal
conference dates
2024-04-14 - 2024-04-20
ISBN
979-8-4007-0501-4/24/04
DOI
10.1145/3639477.3639712
project
Continuous system testing using autonomous monitors
language
English
LU publication?
yes
id
4e3a4332-ab61-45bb-aa7f-016634d7520b
date added to LUP
2024-05-28 09:10:01
date last changed
2024-06-24 10:31:49
@inproceedings{4e3a4332-ab61-45bb-aa7f-016634d7520b,
  abstract     = {{Detecting failures early in cloud-based software systems is highly significant as it can reduce operational costs, enhance service reliability, and improve user experience. Many existing approaches include anomaly detection in metrics or a blend of metric and log features. However, such approaches tend to be very complex and hardly explainable, and consequently non-trivial for implementation and evaluation in industrial contexts. In collaboration with a case company and their cloud-based system in the domain of PIM (Product Information Management), we propose and implement autonomous monitors for proactive monitoring across multiple services of distributed software architecture, fused with anomaly detection in performance metrics and log analysis using GPT-3. We demonstrated that operations engineers tend to be more efficient by having access to interpretable alert notifications based on detected anomalies that contain information about implications and potential root causes. Additionally, proposed autonomous monitors turned out to be beneficial for the timely identification and revision of potential issues before they propagate and cause severe consequences.}},
  author       = {{Hrusto, Adha and Runeson, Per and Ohlsson, Magnus C}},
  booktitle    = {{46th International Conference on Software Engineering: Software Engineering in Practice}},
  isbn         = {{979-8-4007-0501-4/24/04}},
  language     = {{eng}},
  publisher    = {{Association for Computing Machinery (ACM)}},
  title        = {{Autonomous Monitors for Detecting Failures Early and Reporting Interpretable Alerts in Cloud Operations}},
  url          = {{https://lup.lub.lu.se/search/files/187867263/ICSE_2024_SEIP_Monitoring_and_Anomaly_Detection.pdf}},
  doi          = {{10.1145/3639477.3639712}},
  year         = {{2024}},
}