Docforia: A Multilayer Document Model
(2016) Sixth Swedish Language Technology Conference (SLTC 2016)- Abstract
- In this paper, we describe Docforia, a multilayer document model and application programming interface (API) to store formatting, lexical, syntactic, and semantic annotations on Wikipedia and other kinds of text and visualize them. While Wikipedia has become a major NLP resource, its scale and heterogeneity makes it relatively difficult to do experimentations on the whole corpus. These experimentations are rendered even more complex as, to the best of our knowledge, there is no available tool to visualize easily the results of a processing pipeline. We designed Docforia so that it can store millions of documents and billions of tokens, annotated using different processing tools, that themselves use multiple formats, and compatible with... (More)
- In this paper, we describe Docforia, a multilayer document model and application programming interface (API) to store formatting, lexical, syntactic, and semantic annotations on Wikipedia and other kinds of text and visualize them. While Wikipedia has become a major NLP resource, its scale and heterogeneity makes it relatively difficult to do experimentations on the whole corpus. These experimentations are rendered even more complex as, to the best of our knowledge, there is no available tool to visualize easily the results of a processing pipeline. We designed Docforia so that it can store millions of documents and billions of tokens, annotated using different processing tools, that themselves use multiple formats, and compatible with cluster computing frameworks such as Hadoop or Spark. The annotation output, either partial or complete, can then be shared more easily. To validate Docforia, we processed six language versions of Wikipedia: English, French, German, Spanish, Russian, and Swedish, up to semantic role labeling, depending on the NLP tools available for a given language. We stored the results in our document model and we created a visualization tool to inspect the annotation results. The Docforia API is available at https://github.com/marcusklang/docforia.
(Less)
Please use this url to cite or link to this publication:
https://lup.lub.lu.se/record/40456a96-42b1-4200-9667-d1c91e481994
- author
- Klang, Marcus
LU
and Nugues, Pierre LU
- organization
- publishing date
- 2016
- type
- Contribution to conference
- publication status
- published
- subject
- conference name
- Sixth Swedish Language Technology Conference (SLTC 2016)
- conference location
- Umeå, Sweden
- conference dates
- 2016-11-17 - 2016-11-18
- language
- English
- LU publication?
- yes
- id
- 40456a96-42b1-4200-9667-d1c91e481994
- alternative location
- http://www8.cs.umu.se/~johanna/sltc2016/abstracts/SLTC_2016_paper_4.pdf
- date added to LUP
- 2017-01-11 16:52:58
- date last changed
- 2021-05-05 21:58:00
@misc{40456a96-42b1-4200-9667-d1c91e481994, abstract = {{In this paper, we describe Docforia, a multilayer document model and application programming interface (API) to store formatting, lexical, syntactic, and semantic annotations on Wikipedia and other kinds of text and visualize them. While Wikipedia has become a major NLP resource, its scale and heterogeneity makes it relatively difficult to do experimentations on the whole corpus. These experimentations are rendered even more complex as, to the best of our knowledge, there is no available tool to visualize easily the results of a processing pipeline. We designed Docforia so that it can store millions of documents and billions of tokens, annotated using different processing tools, that themselves use multiple formats, and compatible with cluster computing frameworks such as Hadoop or Spark. The annotation output, either partial or complete, can then be shared more easily. To validate Docforia, we processed six language versions of Wikipedia: English, French, German, Spanish, Russian, and Swedish, up to semantic role labeling, depending on the NLP tools available for a given language. We stored the results in our document model and we created a visualization tool to inspect the annotation results. The Docforia API is available at https://github.com/marcusklang/docforia.<br/>}}, author = {{Klang, Marcus and Nugues, Pierre}}, language = {{eng}}, title = {{Docforia: A Multilayer Document Model}}, url = {{http://www8.cs.umu.se/~johanna/sltc2016/abstracts/SLTC_2016_paper_4.pdf}}, year = {{2016}}, }