Files and code for English dictionaries, gold and silver standard corpora for biomedical natural language processing related to SARS-CoV-2 and COVID-19 : Dataset record

Kazemi Rashed, Salma; Ahmed, Rafsan; Frid, Johan; Aits, Sonja

Files and code for English dictionaries, gold and silver standard corpora for biomedical natural language processing related to SARS-CoV-2 and COVID-19 : Dataset record

Mark

Kazemi Rashed, Salma ^LU ; Ahmed, Rafsan ^LU

; Frid, Johan ^LU

and Aits, Sonja ^LU

(2022)

Abstract: BACKGROUND
Automated information extraction with natural language processing (NLP) tools is required to gain systematic insights from the large number of COVID-19 publications, reports and social media posts, which far exceed human processing capabilities. A key challenge for NLP is the extensive variation in terminology used to describe medical entities, which was especially pronounced for this newly emergent disease.

FINDINGS
Here we present an NLP toolbox comprising very large English dictionaries of synonyms for SARS-CoV-2 (including variant names) and COVID-19, which can be used with dictionary-based NLP tools. We also present a silver standard corpus generated with the dictionaries, and a gold standard corpus,... (More); BACKGROUND
Automated information extraction with natural language processing (NLP) tools is required to gain systematic insights from the large number of COVID-19 publications, reports and social media posts, which far exceed human processing capabilities. A key challenge for NLP is the extensive variation in terminology used to describe medical entities, which was especially pronounced for this newly emergent disease.

FINDINGS
Here we present an NLP toolbox comprising very large English dictionaries of synonyms for SARS-CoV-2 (including variant names) and COVID-19, which can be used with dictionary-based NLP tools. We also present a silver standard corpus generated with the dictionaries, and a gold standard corpus, consisting of PubMed abstracts manually annotated for disease, virus, symptom, protein/gene, cell type, chemical and species terms, which can be used to train and evaluate COVID-19-related NLP tools. Code for annotation, which can be used to expand the silver standard corpus or for text mining is also included. This toolbox is freely available on Github (on https://github.com/Aitslab/corona) and here.

CONCLUSIONS
The toolbox can be used for a variety of text analytics tasks related to the COVID-19 crisis and has already been used to create a COVID-19 knowledge graph, study the variability and evolution of COVID-19-related terminology and develop and benchmark text mining tools. (Less)

Please use this url to cite or link to this publication: https://lup.lub.lu.se/record/f54a0266-5330-426a-a140-6303960592bb

author

Kazemi Rashed, Salma ^LU ; Ahmed, Rafsan ^LU

; Frid, Johan ^LU

and Aits, Sonja ^LU

organization

publishing date

2022-06-14

type

Other contribution

publication status

published

subject

Infectious Medicine

publisher

Zenodo

DOI

10.5281/zenodo.6642275

project

Lund University AI Research

Biomedical text mining for systems biology

Artificial intelligence-based text mining for COVID-19 and other areas of medicine

Studying COVID-19 with artificial intelligence

language

English

LU publication?

yes

id

f54a0266-5330-426a-a140-6303960592bb

date added to LUP

2023-01-07 12:40:24

date last changed

2025-04-04 15:27:27

@misc{f54a0266-5330-426a-a140-6303960592bb,
  abstract     = {{BACKGROUND<br/>Automated information extraction with natural language processing (NLP) tools is required to gain systematic insights from the large number of COVID-19 publications, reports and social media posts, which far exceed human processing capabilities. A key challenge for NLP is the extensive variation in terminology used to describe medical entities, which was especially pronounced for this newly emergent disease.<br/><br/>FINDINGS<br/>Here we present an NLP toolbox comprising very large English dictionaries of synonyms for SARS-CoV-2 (including variant names) and COVID-19, which can be used with dictionary-based NLP tools. We also present a silver standard corpus generated with the dictionaries, and a gold standard corpus, consisting of PubMed abstracts manually annotated for disease, virus, symptom, protein/gene, cell type, chemical and species terms, which can be used to train and evaluate COVID-19-related NLP tools. Code for annotation, which can be used to expand the silver standard corpus or for text mining is also included. This toolbox is freely available on Github (on https://github.com/Aitslab/corona) and here.<br/><br/>CONCLUSIONS<br/>The toolbox can be used for a variety of text analytics tasks related to the COVID-19 crisis and has already been used to create a COVID-19 knowledge graph, study the variability and evolution of COVID-19-related terminology and develop and benchmark text mining tools.}},
  author       = {{Kazemi Rashed, Salma and Ahmed, Rafsan and Frid, Johan and Aits, Sonja}},
  language     = {{eng}},
  month        = {{06}},
  publisher    = {{Zenodo}},
  title        = {{Files and code for English dictionaries, gold and silver standard corpora for biomedical natural language processing related to SARS-CoV-2 and COVID-19 : Dataset record}},
  url          = {{http://dx.doi.org/10.5281/zenodo.6642275}},
  doi          = {{10.5281/zenodo.6642275}},
  year         = {{2022}},
}

Lund University Publications

LUND UNIVERSITY LIBRARIES

Files and code for English dictionaries, gold and silver standard corpora for biomedical natural language processing related to SARS-CoV-2 and COVID-19 : Dataset record