Change-Aware Predictive Test Selection and Prioritization

Frid, Josefine; Ramne, Emma

Change-Aware Predictive Test Selection and Prioritization

Mark

Frid, Josefine ^LU and Ramne, Emma (2025) In Master's Thesis in Mathematical Sciences FMSM01 20251
Mathematical Statistics

Abstract: Test Case Prioritization and Selection (TCPS) methods aim to order or select tests to minimize test execution time without significantly reducing test coverage. In collaboration with a System Analysis Team at Arm, this project demonstrated machine learning-driven prioritization in an industrial Continuous Integration (CI) pipeline, using predictive models to optimize test execution. The team runs a comprehensive nightly test set (N) to detect faulty code introduced during the day, but also executes a smaller subset of these tests (C ⊂ N) before any code change is added to the code base, aiming to catch faults as early as possible. We added a new test set (E ⊂ N \ C) which ran concurrently with C for a set time period and improved the total... (More); Test Case Prioritization and Selection (TCPS) methods aim to order or select tests to minimize test execution time without significantly reducing test coverage. In collaboration with a System Analysis Team at Arm, this project demonstrated machine learning-driven prioritization in an industrial Continuous Integration (CI) pipeline, using predictive models to optimize test execution. The team runs a comprehensive nightly test set (N) to detect faulty code introduced during the day, but also executes a smaller subset of these tests (C ⊂ N) before any code change is added to the code base, aiming to catch faults as early as possible. We added a new test set (E ⊂ N \ C) which ran concurrently with C for a set time period and improved the total fault-finding rate by 600%. During this period, data about test outcomes, flaky tests and characteristics of code changes were collected. This dataset was then used to train machine learning models to predict the optimal test order of the combined tests in C ∪ E, given a certain code change, in order to catch all faults in minimal time. The top predicted tests were then selected to run on each change. In our Arm case study, the best performing model was a listwise ranking model, which could find >97% of all faulty code changes, 6.8 times as many as C, while saving on average >90% of the execution time needed by C. This approach makes no assumptions regarding software type, programming language, or technology stack, relying solely on features readily extracted from CI and version control systems, thus being widely applicable both within and outside Arm. (Less)
Popular Abstract (Swedish): Utvecklares kodändringar testas kontinuerligt för att undvika nya buggar. Med växande kodbaser och team ökar både antalet tester och testfrekvensen kraftigt. Det leder till en exponentiell ökning av testarbete. Konsekvensen blir både långsamma processer och höga kostnader. För att snabba på kodtestning utan att släppa igenom kodbuggar har vi tränat ett antal maskininlärningsmodeller att förutspå vilka tester som är mest relevanta att köra givet en viss kodändring. Dessa modeller tar in information om både testerna som körs och kodändringen för att göra sina bedömningar. Vårt mål var att endast använda information som kan hämtas om alla kodbaser, så att de tränade modellerna förhoppningsvis lätt kan återskapas till andra projekt. Vi tränade... (More); Utvecklares kodändringar testas kontinuerligt för att undvika nya buggar. Med växande kodbaser och team ökar både antalet tester och testfrekvensen kraftigt. Det leder till en exponentiell ökning av testarbete. Konsekvensen blir både långsamma processer och höga kostnader. För att snabba på kodtestning utan att släppa igenom kodbuggar har vi tränat ett antal maskininlärningsmodeller att förutspå vilka tester som är mest relevanta att köra givet en viss kodändring. Dessa modeller tar in information om både testerna som körs och kodändringen för att göra sina bedömningar. Vårt mål var att endast använda information som kan hämtas om alla kodbaser, så att de tränade modellerna förhoppningsvis lätt kan återskapas till andra projekt. Vi tränade 8 olika modeller för ändamålet. 4 modeller tränades för att förutspå huruvida ett test skulle misslyckas, och denna information användes sedan för att rangordna testerna (1) efter förutspått resultat (misslyckade över lyckade) och (2) i stigande ordning av testernas senaste exekveringstid. De andra 4 tränades att direkt förutspå ett tests relevans. Relevansen för ett test bedömdes både utefter huruvida det misslyckades eller inte, samt hur lång dess körtid var. Vi fann att modellerna som tränades på relevans gav ett bättre helhetsresultat. Vår bästa modell kunde rangordna tester såpass att den sparar >90% av den ursprungliga exekveringstiden och ändå hittar >97% av de totala felaktiga kodändringarna. Att sätta denna modell i bruk kan leda till snabbare feedback för utvecklare om kodbuggar, större träffsäkerhet när det kommer till kodtestning samt lägre driftskostnader. (Less)

Please use this url to cite or link to this publication: http://lup.lub.lu.se/student-papers/record/9195103

author

Frid, Josefine ^LU and Ramne, Emma

supervisor

Filip Tronarp ^LU

organization

Mathematical Statistics

alternative title

Ändringsmedveten prediktiv selektion och prioritering av tester

course

FMSM01 20251

year

2025

type

H2 - Master's Degree (Two Years)

subject

Technology and Engineering

keywords

Machine Learning, Test Prioritization, Test Selection, Continuous Integration, Ranking Models

publication/series

Master's Thesis in Mathematical Sciences

report number

LUTFMS-3516-2025

ISSN

1404-6342

other publication id

2025:E27

language

English

id

9195103

date added to LUP

2025-06-09 16:57:22

date last changed

2025-06-09 16:57:22

@misc{9195103,
  abstract     = {{Test Case Prioritization and Selection (TCPS) methods aim to order or select tests to minimize test execution time without significantly reducing test coverage. In collaboration with a System Analysis Team at Arm, this project demonstrated machine learning-driven prioritization in an industrial Continuous Integration (CI) pipeline, using predictive models to optimize test execution. The team runs a comprehensive nightly test set (N) to detect faulty code introduced during the day, but also executes a smaller subset of these tests (C ⊂ N) before any code change is added to the code base, aiming to catch faults as early as possible. We added a new test set (E ⊂ N \ C) which ran concurrently with C for a set time period and improved the total fault-finding rate by 600%. During this period, data about test outcomes, flaky tests and characteristics of code changes were collected. This dataset was then used to train machine learning models to predict the optimal test order of the combined tests in C ∪ E, given a certain code change, in order to catch all faults in minimal time. The top predicted tests were then selected to run on each change. In our Arm case study, the best performing model was a listwise ranking model, which could find >97% of all faulty code changes, 6.8 times as many as C, while saving on average >90% of the execution time needed by C. This approach makes no assumptions regarding software type, programming language, or technology stack, relying solely on features readily extracted from CI and version control systems, thus being widely applicable both within and outside Arm.}},
  author       = {{Frid, Josefine and Ramne, Emma}},
  issn         = {{1404-6342}},
  language     = {{eng}},
  note         = {{Student Paper}},
  series       = {{Master's Thesis in Mathematical Sciences}},
  title        = {{Change-Aware Predictive Test Selection and Prioritization}},
  year         = {{2025}},
}

LUP Student Papers

LUND UNIVERSITY LIBRARIES

Change-Aware Predictive Test Selection and Prioritization