Assessing the robustness of AI-generated lesion risk scores acquired under various imaging conditions
(2025) MSFT02 20252Medical Physics Programme
- Abstract
- Background and Aim: Artificial intelligence (AI) in mammography screening can aid cancer detection. Commercially available AI systems can assign region and exam scores based on malignancy suspicion. Despite promising suggestions, clinical implementation is limited- partly due to a lack of trust among radiologist and prospective evaluation. Inevitably, this leads to the question of how an AI system intended to be used in mammography screening can be validated.
This study aims at assessing the robustness of an AI system’s response to various image acquisition conditions using an anthropomorphic breast phantom. Precision of obtained risk scores and possible relationships between exposure parameters and risk scores will be investigated.
... (More) - Background and Aim: Artificial intelligence (AI) in mammography screening can aid cancer detection. Commercially available AI systems can assign region and exam scores based on malignancy suspicion. Despite promising suggestions, clinical implementation is limited- partly due to a lack of trust among radiologist and prospective evaluation. Inevitably, this leads to the question of how an AI system intended to be used in mammography screening can be validated.
This study aims at assessing the robustness of an AI system’s response to various image acquisition conditions using an anthropomorphic breast phantom. Precision of obtained risk scores and possible relationships between exposure parameters and risk scores will be investigated.
Material and Methods: Digital mammography (DM) and digital breast tomosynthesis (DBT) images of a breast phantom containing one spiculated mass were acquired using Siemens MAMMOMAT Inspiration. Exposure parameters such as tube voltage (kV) and tube loading (mAs) were varied relative to those obtained using automatic exposure control (AEC). For DM, this was tested using the following anode/filter combinations: W/Rh, Mo/Mo and Mo/Rh, and W/Rh was used for DBT. Five repeated exposures were made for each combination of kV and mAs. AEC mode was used while varying other settings, including phantom position, compression plate release, exclusion of highly attenuating “chest wall” and combinations thereof. Most setups included 20 repeated exposures. The organ dose was extracted from the DICOM header and used as substitute for average glandular dose (AGD). Images were analyzed in an AI system for region and exam scores. Number of cases where the AI system presented a region score of the lesion (lesion risk score) was recorded. Linear regression analysis assessed possible associations between kV/mAs and lesion risk scores. Mean lesion risk scores from AEC data sets were compared pairwise.
Results: The AI system provided lesion risk scores for all images acquired using AEC mode. When varying the exposure parameters for DM, scores were given in 93%, 94% and 92% of the images for W/Rh, Mo/Mo and Mo/Rh respectively. Generally, a wide range of risk scores were reported within each DM data set. The precision was better for DBT and the risk scores were higher, resulting in significant difference in mean lesion risk scores (-25.1, 95% CI (-28.9, -21.3), p < 0.001) between DM and DBT. No other significant differences between AEC data sets were found for DM. Moving the phantom (in contrast to centering) in DBT showed a significant difference in mean lesion risk scores (3.18, 95% CI (0.06, 6.30), p = 0.042).
Weak to somewhat strong significant linear association between each exposure parameter and lesion risk scores were found in most DM imaging conditions, varying by anode/filter combination. However, DBT showed consistent moderate to strong significant positive linear relationships between kV and mAs respectively, and lesion risk scores.
Conclusion: The unexpected wide range of lesion risk scores within data sets could be a sign of deficient precision of AI systems and the possible reason for this needs to be further investigated. (Less) - Popular Abstract (Swedish)
- Bröstcancer är den vanligaste cancerdiagnosen bland kvinnor. I Sverige erbjuds alla kvinnor mellan 40-74 år regelbunden undersökning (mammografiscreening). Att upptäcka bröstcancer i ett tidigt skede har visats vara mycket gynnsamt för att kunna få effektiv behandling. Vid undersökningstillfället tas röntgenbilder av brösten, så kallade digitala mammografibilder (DM). I dagsläget utförs dubbelgranskning av samtliga mammografibilder, vilket innebär att två läkare granskar bilderna som tillsammans avgör om kvinnan ska återkallas för vidare undersökning eller om kvinnan kan frias. Vid återkallelse brukar istället digitala brösttomosyntesbilder (DBT) samlas in, som är en form av tredimensionell bildtagningsteknik. En oundviklig nackdel med... (More)
- Bröstcancer är den vanligaste cancerdiagnosen bland kvinnor. I Sverige erbjuds alla kvinnor mellan 40-74 år regelbunden undersökning (mammografiscreening). Att upptäcka bröstcancer i ett tidigt skede har visats vara mycket gynnsamt för att kunna få effektiv behandling. Vid undersökningstillfället tas röntgenbilder av brösten, så kallade digitala mammografibilder (DM). I dagsläget utförs dubbelgranskning av samtliga mammografibilder, vilket innebär att två läkare granskar bilderna som tillsammans avgör om kvinnan ska återkallas för vidare undersökning eller om kvinnan kan frias. Vid återkallelse brukar istället digitala brösttomosyntesbilder (DBT) samlas in, som är en form av tredimensionell bildtagningsteknik. En oundviklig nackdel med screening är att det kommer förekomma fall där individer blir återkallade trots att de ej har bröstcancer, så kallade falskt positiva återkallelser. Detta är inte enbart resurskrävande utan kan framförallt vara psykiskt påfrestande för den enskilde individen.
För att underlätta granskningsprocessen och med förhoppning att kunna minska antalet falskt positiva återkallelser har möjligheten att använda artificiell intelligens (AI) undersökts i flera studier. Det råder dock viss osäkerhet kring hur man kan säkerställa kvaliteten hos ett AI-program och det förekommer begränsad kunskap kring hur robusta resultaten från AI är. Syftet med detta arbete är att undersöka hur ett AI-programs riskbedömningar påverkas av variationer i bildkvalitet, som skärpa, kontrast samt bl.a. förflyttningar av bröstet vid bildtagning.
I studien har ett AI-program använts som söker efter avvikande strukturer i mammografibilder och ansätter riskpoäng till dessa. Om ett fynd hittas poängsätts detta där poängen motsvarar nivå av misstanke att fyndet är bröstcancer. Baserat på samtliga fynd i en bild ger AI-programmet ett sammanfattande poäng som motsvarar sannolikheten att kvinnan har bröstcancer. DM- och DBT-bilder av ett objekt som ska efterlikna ett bröst innehållande en tumör, ett så kallat bröstfantom, har samlats in där bildkvaliteten har varierats. Dessa bilder har sedan analyserats av AI-programmet för att erhålla riskpoäng.
Resultatet visade oväntat stor spridning av riskpoäng vid upprepade DM-bildtagningar, vilket skulle kunna tyda på låg precision hos AI-programmet. För DBT-bilderna var riskpoängen generellt högre och mer stabila. Att förflytta bröstfantomet i sidled hade en effekt på riskpoängen i DBT-bilderna, men inte för DM-bilderna. Både skärpa och kontrast verkade influera riskpoängen i olika utsträckning, särskilt i DBT-bilderna. Detta kan tyda på att bildkvalitet har en viss påverkan på riskpoäng.
Den oväntade spridningen av riskpoäng som verkar förekomma i fall där den inte borde behöver undersökas vidare för att kunna fastställa om AI-programmet levererar resultat med begränsad precision. Det krävs dessutom vidare studier för att avgöra om detta skulle kunna påverka granskningsprocessen. (Less)
Please use this url to cite or link to this publication:
http://lup.lub.lu.se/student-papers/record/9207787
- author
- Alström, Lina
- supervisor
- organization
- course
- MSFT02 20252
- year
- 2025
- type
- H2 - Master's Degree (Two Years)
- subject
- language
- English
- id
- 9207787
- date added to LUP
- 2025-07-02 10:16:12
- date last changed
- 2025-07-02 10:16:12
@misc{9207787, abstract = {{Background and Aim: Artificial intelligence (AI) in mammography screening can aid cancer detection. Commercially available AI systems can assign region and exam scores based on malignancy suspicion. Despite promising suggestions, clinical implementation is limited- partly due to a lack of trust among radiologist and prospective evaluation. Inevitably, this leads to the question of how an AI system intended to be used in mammography screening can be validated. This study aims at assessing the robustness of an AI system’s response to various image acquisition conditions using an anthropomorphic breast phantom. Precision of obtained risk scores and possible relationships between exposure parameters and risk scores will be investigated. Material and Methods: Digital mammography (DM) and digital breast tomosynthesis (DBT) images of a breast phantom containing one spiculated mass were acquired using Siemens MAMMOMAT Inspiration. Exposure parameters such as tube voltage (kV) and tube loading (mAs) were varied relative to those obtained using automatic exposure control (AEC). For DM, this was tested using the following anode/filter combinations: W/Rh, Mo/Mo and Mo/Rh, and W/Rh was used for DBT. Five repeated exposures were made for each combination of kV and mAs. AEC mode was used while varying other settings, including phantom position, compression plate release, exclusion of highly attenuating “chest wall” and combinations thereof. Most setups included 20 repeated exposures. The organ dose was extracted from the DICOM header and used as substitute for average glandular dose (AGD). Images were analyzed in an AI system for region and exam scores. Number of cases where the AI system presented a region score of the lesion (lesion risk score) was recorded. Linear regression analysis assessed possible associations between kV/mAs and lesion risk scores. Mean lesion risk scores from AEC data sets were compared pairwise. Results: The AI system provided lesion risk scores for all images acquired using AEC mode. When varying the exposure parameters for DM, scores were given in 93%, 94% and 92% of the images for W/Rh, Mo/Mo and Mo/Rh respectively. Generally, a wide range of risk scores were reported within each DM data set. The precision was better for DBT and the risk scores were higher, resulting in significant difference in mean lesion risk scores (-25.1, 95% CI (-28.9, -21.3), p < 0.001) between DM and DBT. No other significant differences between AEC data sets were found for DM. Moving the phantom (in contrast to centering) in DBT showed a significant difference in mean lesion risk scores (3.18, 95% CI (0.06, 6.30), p = 0.042). Weak to somewhat strong significant linear association between each exposure parameter and lesion risk scores were found in most DM imaging conditions, varying by anode/filter combination. However, DBT showed consistent moderate to strong significant positive linear relationships between kV and mAs respectively, and lesion risk scores. Conclusion: The unexpected wide range of lesion risk scores within data sets could be a sign of deficient precision of AI systems and the possible reason for this needs to be further investigated.}}, author = {{Alström, Lina}}, language = {{eng}}, note = {{Student Paper}}, title = {{Assessing the robustness of AI-generated lesion risk scores acquired under various imaging conditions}}, year = {{2025}}, }