Accessible data curation and analytics for international-scale citizen science datasets
(2021) In Scientific Data 8(1).- Abstract
The Covid Symptom Study, a smartphone-based surveillance study on COVID-19 symptoms in the population, is an exemplar of big data citizen science. As of May 23rd, 2021, over 5 million participants have collectively logged over 360 million self-assessment reports since its introduction in March 2020. The success of the Covid Symptom Study creates significant technical challenges around effective data curation. The primary issue is scale. The size of the dataset means that it can no longer be readily processed using standard Python-based data analytics software such as Pandas on commodity hardware. Alternative technologies exist but carry a higher technical complexity and are less accessible to many researchers. We present ExeTera, a... (More)
The Covid Symptom Study, a smartphone-based surveillance study on COVID-19 symptoms in the population, is an exemplar of big data citizen science. As of May 23rd, 2021, over 5 million participants have collectively logged over 360 million self-assessment reports since its introduction in March 2020. The success of the Covid Symptom Study creates significant technical challenges around effective data curation. The primary issue is scale. The size of the dataset means that it can no longer be readily processed using standard Python-based data analytics software such as Pandas on commodity hardware. Alternative technologies exist but carry a higher technical complexity and are less accessible to many researchers. We present ExeTera, a Python-based open source software package designed to provide Pandas-like data analytics on datasets that approach terabyte scales. We present its design and capabilities, and show how it is a critical component of a data curation pipeline that enables reproducible research across an international research group for the Covid Symptom Study.
(Less)
- author
- organization
- publishing date
- 2021
- type
- Contribution to journal
- publication status
- published
- subject
- in
- Scientific Data
- volume
- 8
- issue
- 1
- article number
- 297
- publisher
- Nature Publishing Group
- external identifiers
-
- scopus:85119660232
- pmid:34811392
- ISSN
- 2052-4463
- DOI
- 10.1038/s41597-021-01071-x
- language
- English
- LU publication?
- yes
- id
- d652a4fe-5544-4171-b8ca-1ac8db990ece
- date added to LUP
- 2021-12-08 13:58:18
- date last changed
- 2024-12-29 18:39:11
@article{d652a4fe-5544-4171-b8ca-1ac8db990ece, abstract = {{<p>The Covid Symptom Study, a smartphone-based surveillance study on COVID-19 symptoms in the population, is an exemplar of big data citizen science. As of May 23rd, 2021, over 5 million participants have collectively logged over 360 million self-assessment reports since its introduction in March 2020. The success of the Covid Symptom Study creates significant technical challenges around effective data curation. The primary issue is scale. The size of the dataset means that it can no longer be readily processed using standard Python-based data analytics software such as Pandas on commodity hardware. Alternative technologies exist but carry a higher technical complexity and are less accessible to many researchers. We present ExeTera, a Python-based open source software package designed to provide Pandas-like data analytics on datasets that approach terabyte scales. We present its design and capabilities, and show how it is a critical component of a data curation pipeline that enables reproducible research across an international research group for the Covid Symptom Study.</p>}}, author = {{Murray, Benjamin and Kerfoot, Eric and Chen, Liyuan and Deng, Jie and Graham, Mark S. and Sudre, Carole H. and Molteni, Erika and Canas, Liane S. and Antonelli, Michela and Klaser, Kerstin and Visconti, Alessia and Hammers, Alexander and Chan, Andrew T. and Franks, Paul W. and Davies, Richard and Wolf, Jonathan and Spector, Tim D. and Steves, Claire J. and Modat, Marc and Ourselin, Sebastien}}, issn = {{2052-4463}}, language = {{eng}}, number = {{1}}, publisher = {{Nature Publishing Group}}, series = {{Scientific Data}}, title = {{Accessible data curation and analytics for international-scale citizen science datasets}}, url = {{http://dx.doi.org/10.1038/s41597-021-01071-x}}, doi = {{10.1038/s41597-021-01071-x}}, volume = {{8}}, year = {{2021}}, }