Skip to main content

Lund University Publications

LUND UNIVERSITY LIBRARIES

Use of data mining and artificial intelligence to derive public health evidence from large datasets

Fitipaldi, Hugo LU (2023) In Lund University, Faculty of Medicine Doctoral Dissertation Series
Abstract
This thesis explores the use of data mining and AI-tailored frameworks for extracting public health evidence from large health datasets. The research presented in this thesis demonstrates the potential of these tools for automating and simplifying the data mining process, and for providing valuable insights into various public health issues.
In Paper I, we used data mining and natural language processing to analyze the characteristics of genomic research on non-communicable diseases (NCDs) from the GWAS Catalog (2005 to 2022). We found that the majority of research institutions leading the work are often US-based and the majority of first, senior and all authors were male. The vast majority of complex trait GWAS has been performed in... (More)
This thesis explores the use of data mining and AI-tailored frameworks for extracting public health evidence from large health datasets. The research presented in this thesis demonstrates the potential of these tools for automating and simplifying the data mining process, and for providing valuable insights into various public health issues.
In Paper I, we used data mining and natural language processing to analyze the characteristics of genomic research on non-communicable diseases (NCDs) from the GWAS Catalog (2005 to 2022). We found that the majority of research institutions leading the work are often US-based and the majority of first, senior and all authors were male. The vast majority of complex trait GWAS has been performed in European ancestry populations, with cohorts and scientists predominantly located in medium-to-high socioeconomically ranked countries. This lack of diversity in both the data and the authorship of GWAS research has potential implications for the generalizability of genetic discoveries and the development of future interventions.
In Paper II, we analyzed data collected through the app-based COVID Symptom Study in Sweden. We then created a symptom-based model to estimate the individual probability of symptomatic COVID-19 and employed this to estimate daily regional COVID-19 prevalence. We also used this data to predict next week COVID-19 hospital admissions and compared it to a model based on case notifications. We found that the symptom-based model had a lower median absolute percentage error during the first wave of the pandemic and that the model was transferable to an English dataset. The findings of this study demonstrate the feasibility of large-scale syndromic surveillance and the potential for population-based participatory surveillance initiatives in future pandemics and epidemics.
In Paper III, we used data from over 500,000 participants in the COVID Symptom Study to investigate the impact of obesity and diabetes on the symptoms and duration of long-COVID. Using advanced data mining techniques, we found that individuals with higher BMI and diabetes had a higher burden of symptoms during the initial COVID-19 infection and a prolonged duration of long-COVID symptoms. We also found that vaccination had a protective effect against both COVID-19 symptoms and long-COVID symptoms in these at-risk groups. Our results demonstrate the disproportionate impact of COVID-19 on certain populations and the utility of app-based syndromic surveillance in providing timely and accurate information on the spread and impact of the virus. (Less)
Please use this url to cite or link to this publication:
author
supervisor
opponent
  • PhD Langenberg, Claudia, MRC Epidemiology Unit, University of Cambridge School of Clinical Medicine
organization
publishing date
type
Thesis
publication status
published
subject
keywords
artificial Intelligence, data mining, genome-wide association studies, covid-19
in
Lund University, Faculty of Medicine Doctoral Dissertation Series
issue
2023:24
pages
105 pages
publisher
Lund University, Faculty of Medicine
defense location
Agardh föreläsningssal, CRC, Jan Waldenströms gata 35, Skånes Universitetssjukhus i Malmö. Zoom: https://lu-se.zoom.us/j/68486860286
defense date
2023-03-02 13:00:00
ISSN
1652-8220
ISBN
978-91-8021-363-9
language
English
LU publication?
yes
id
9d3c2806-d0d2-4230-ab34-b37caec4b97a
date added to LUP
2023-02-02 14:17:48
date last changed
2023-02-16 08:34:13
@phdthesis{9d3c2806-d0d2-4230-ab34-b37caec4b97a,
  abstract     = {{This thesis explores the use of data mining and AI-tailored frameworks for extracting public health evidence from large health datasets. The research presented in this thesis demonstrates the potential of these tools for automating and simplifying the data mining process, and for providing valuable insights into various public health issues.<br/>In Paper I, we used data mining and natural language processing to analyze the characteristics of genomic research on non-communicable diseases (NCDs) from the GWAS Catalog (2005 to 2022). We found that the majority of research institutions leading the work are often US-based and the majority of first, senior and all authors were male. The vast majority of complex trait GWAS has been performed in European ancestry populations, with cohorts and scientists predominantly located in medium-to-high socioeconomically ranked countries. This lack of diversity in both the data and the authorship of GWAS research has potential implications for the generalizability of genetic discoveries and the development of future interventions.<br/>In Paper II, we analyzed data collected through the app-based COVID Symptom Study in Sweden. We then created a symptom-based model to estimate the individual probability of symptomatic COVID-19 and employed this to estimate daily regional COVID-19 prevalence. We also used this data to predict next week COVID-19 hospital admissions and compared it to a model based on case notifications. We found that the symptom-based model had a lower median absolute percentage error during the first wave of the pandemic and that the model was transferable to an English dataset. The findings of this study demonstrate the feasibility of large-scale syndromic surveillance and the potential for population-based participatory surveillance initiatives in future pandemics and epidemics.<br/>In Paper III, we used data from over 500,000 participants in the COVID Symptom Study to investigate the impact of obesity and diabetes on the symptoms and duration of long-COVID. Using advanced data mining techniques, we found that individuals with higher BMI and diabetes had a higher burden of symptoms during the initial COVID-19 infection and a prolonged duration of long-COVID symptoms. We also found that vaccination had a protective effect against both COVID-19 symptoms and long-COVID symptoms in these at-risk groups. Our results demonstrate the disproportionate impact of COVID-19 on certain populations and the utility of app-based syndromic surveillance in providing timely and accurate information on the spread and impact of the virus.}},
  author       = {{Fitipaldi, Hugo}},
  isbn         = {{978-91-8021-363-9}},
  issn         = {{1652-8220}},
  keywords     = {{artificial Intelligence; data mining; genome-wide association studies; covid-19}},
  language     = {{eng}},
  number       = {{2023:24}},
  publisher    = {{Lund University, Faculty of Medicine}},
  school       = {{Lund University}},
  series       = {{Lund University, Faculty of Medicine Doctoral Dissertation Series}},
  title        = {{Use of data mining and artificial intelligence to derive public health evidence from large datasets}},
  url          = {{https://lup.lub.lu.se/search/files/136522902/Thesis_HugoFitipaldi.pdf}},
  year         = {{2023}},
}