Skip to main content

LUP Student Papers

LUND UNIVERSITY LIBRARIES

Change-Aware Predictive Test Selection and Prioritization

Frid, Josefine LU and Ramne, Emma (2025) In Master's Thesis in Mathematical Sciences FMSM01 20251
Mathematical Statistics
Abstract
Test Case Prioritization and Selection (TCPS) methods aim to order or select tests to minimize test execution time without significantly reducing test coverage. In collaboration with a System Analysis Team at Arm, this project demonstrated machine learning-driven prioritization in an industrial Continuous Integration (CI) pipeline, using predictive models to optimize test execution. The team runs a comprehensive nightly test set (N) to detect faulty code introduced during the day, but also executes a smaller subset of these tests (C ⊂ N) before any code change is added to the code base, aiming to catch faults as early as possible. We added a new test set (E ⊂ N \ C) which ran concurrently with C for a set time period and improved the total... (More)
Test Case Prioritization and Selection (TCPS) methods aim to order or select tests to minimize test execution time without significantly reducing test coverage. In collaboration with a System Analysis Team at Arm, this project demonstrated machine learning-driven prioritization in an industrial Continuous Integration (CI) pipeline, using predictive models to optimize test execution. The team runs a comprehensive nightly test set (N) to detect faulty code introduced during the day, but also executes a smaller subset of these tests (C ⊂ N) before any code change is added to the code base, aiming to catch faults as early as possible. We added a new test set (E ⊂ N \ C) which ran concurrently with C for a set time period and improved the total fault-finding rate by 600%. During this period, data about test outcomes, flaky tests and characteristics of code changes were collected. This dataset was then used to train machine learning models to predict the optimal test order of the combined tests in C ∪ E, given a certain code change, in order to catch all faults in minimal time. The top predicted tests were then selected to run on each change. In our Arm case study, the best performing model was a listwise ranking model, which could find >97% of all faulty code changes, 6.8 times as many as C, while saving on average >90% of the execution time needed by C. This approach makes no assumptions regarding software type, programming language, or technology stack, relying solely on features readily extracted from CI and version control systems, thus being widely applicable both within and outside Arm. (Less)
Popular Abstract (Swedish)
Utvecklares kodändringar testas kontinuerligt för att undvika nya buggar. Med växande kodbaser och team ökar både antalet tester och testfrekvensen kraftigt. Det leder till en exponentiell ökning av testarbete. Konsekvensen blir både långsamma processer och höga kostnader. För att snabba på kodtestning utan att släppa igenom kodbuggar har vi tränat ett antal maskininlärningsmodeller att förutspå vilka tester som är mest relevanta att köra givet en viss kodändring. Dessa modeller tar in information om både testerna som körs och kodändringen för att göra sina bedömningar. Vårt mål var att endast använda information som kan hämtas om alla kodbaser, så att de tränade modellerna förhoppningsvis lätt kan återskapas till andra projekt. Vi tränade... (More)
Utvecklares kodändringar testas kontinuerligt för att undvika nya buggar. Med växande kodbaser och team ökar både antalet tester och testfrekvensen kraftigt. Det leder till en exponentiell ökning av testarbete. Konsekvensen blir både långsamma processer och höga kostnader. För att snabba på kodtestning utan att släppa igenom kodbuggar har vi tränat ett antal maskininlärningsmodeller att förutspå vilka tester som är mest relevanta att köra givet en viss kodändring. Dessa modeller tar in information om både testerna som körs och kodändringen för att göra sina bedömningar. Vårt mål var att endast använda information som kan hämtas om alla kodbaser, så att de tränade modellerna förhoppningsvis lätt kan återskapas till andra projekt. Vi tränade 8 olika modeller för ändamålet. 4 modeller tränades för att förutspå huruvida ett test skulle misslyckas, och denna information användes sedan för att rangordna testerna (1) efter förutspått resultat (misslyckade över lyckade) och (2) i stigande ordning av testernas senaste exekveringstid. De andra 4 tränades att direkt förutspå ett tests relevans. Relevansen för ett test bedömdes både utefter huruvida det misslyckades eller inte, samt hur lång dess körtid var. Vi fann att modellerna som tränades på relevans gav ett bättre helhetsresultat. Vår bästa modell kunde rangordna tester såpass att den sparar >90% av den ursprungliga exekveringstiden och ändå hittar >97% av de totala felaktiga kodändringarna. Att sätta denna modell i bruk kan leda till snabbare feedback för utvecklare om kodbuggar, större träffsäkerhet när det kommer till kodtestning samt lägre driftskostnader. (Less)
Please use this url to cite or link to this publication:
author
Frid, Josefine LU and Ramne, Emma
supervisor
organization
alternative title
Ändringsmedveten prediktiv selektion och prioritering av tester
course
FMSM01 20251
year
type
H2 - Master's Degree (Two Years)
subject
keywords
Machine Learning, Test Prioritization, Test Selection, Continuous Integration, Ranking Models
publication/series
Master's Thesis in Mathematical Sciences
report number
LUTFMS-3516-2025
ISSN
1404-6342
other publication id
2025:E27
language
English
id
9195103
date added to LUP
2025-06-09 16:57:22
date last changed
2025-06-09 16:57:22
@misc{9195103,
  abstract     = {{Test Case Prioritization and Selection (TCPS) methods aim to order or select tests to minimize test execution time without significantly reducing test coverage. In collaboration with a System Analysis Team at Arm, this project demonstrated machine learning-driven prioritization in an industrial Continuous Integration (CI) pipeline, using predictive models to optimize test execution. The team runs a comprehensive nightly test set (N) to detect faulty code introduced during the day, but also executes a smaller subset of these tests (C ⊂ N) before any code change is added to the code base, aiming to catch faults as early as possible. We added a new test set (E ⊂ N \ C) which ran concurrently with C for a set time period and improved the total fault-finding rate by 600%. During this period, data about test outcomes, flaky tests and characteristics of code changes were collected. This dataset was then used to train machine learning models to predict the optimal test order of the combined tests in C ∪ E, given a certain code change, in order to catch all faults in minimal time. The top predicted tests were then selected to run on each change. In our Arm case study, the best performing model was a listwise ranking model, which could find >97% of all faulty code changes, 6.8 times as many as C, while saving on average >90% of the execution time needed by C. This approach makes no assumptions regarding software type, programming language, or technology stack, relying solely on features readily extracted from CI and version control systems, thus being widely applicable both within and outside Arm.}},
  author       = {{Frid, Josefine and Ramne, Emma}},
  issn         = {{1404-6342}},
  language     = {{eng}},
  note         = {{Student Paper}},
  series       = {{Master's Thesis in Mathematical Sciences}},
  title        = {{Change-Aware Predictive Test Selection and Prioritization}},
  year         = {{2025}},
}