Advanced

Docforia: A Multilayer Document Model

Klang, Marcus LU and Nugues, Pierre LU (2017) 21st Nordic Conference of Computational Linguistics In Linköping Electronic Conference Proceedings 131.
Abstract
In this paper, we describe Docforia, a multilayer document model and application programming interface (API) to store formatting, lexical, syntactic, and semantic annotations on Wikipedia and other kinds of text and visualize them. While Wikipedia has become a major NLP resource, its scale and heterogeneity makes it relatively difficult to do experimentations on the whole corpus. These experimentations are rendered even more complexas,to the best of our knowledge,there is no available tool to visualize easily the results of a processing pipeline. We designed Docforia so that it can store millions of documents and billions of tokens, annotated using different processing tools,that themselves use multiple formats, and compatible with cluster... (More)
In this paper, we describe Docforia, a multilayer document model and application programming interface (API) to store formatting, lexical, syntactic, and semantic annotations on Wikipedia and other kinds of text and visualize them. While Wikipedia has become a major NLP resource, its scale and heterogeneity makes it relatively difficult to do experimentations on the whole corpus. These experimentations are rendered even more complexas,to the best of our knowledge,there is no available tool to visualize easily the results of a processing pipeline. We designed Docforia so that it can store millions of documents and billions of tokens, annotated using different processing tools,that themselves use multiple formats, and compatible with cluster computing frameworks such as Hadoop or Spark. The annotation output, either partial or complete, can then be shared more easily. To validate Docforia, we processed six language versions of Wikipedia: English, French, German, Spanish, Russian, and Swedish, up to semantic role labeling, depending on the NLP tools available for a given language. We stored the results in our document model and we created a visualization tool to inspect the annotation results. (Less)
Please use this url to cite or link to this publication:
author
organization
publishing date
type
Chapter in Book/Report/Conference proceeding
publication status
published
subject
in
Linköping Electronic Conference Proceedings
volume
131
publisher
Linköping University Electronic Press
conference name
21st Nordic Conference of Computational Linguistics
ISSN
1650-3686
1650-3740
language
English
LU publication?
yes
id
450df2ef-6765-4fda-b2bb-4351b5dfe2d6
alternative location
http://www.ep.liu.se/ecp/131/027/ecp17131027.pdf
date added to LUP
2017-05-23 11:48:04
date last changed
2017-06-08 11:23:24
@inproceedings{450df2ef-6765-4fda-b2bb-4351b5dfe2d6,
  abstract     = {In this paper, we describe Docforia, a multilayer document model and application programming interface (API) to store formatting, lexical, syntactic, and semantic annotations on Wikipedia and other kinds of text and visualize them. While Wikipedia has become a major NLP resource, its scale and heterogeneity makes it relatively difficult to do experimentations on the whole corpus. These experimentations are rendered even more complexas,to the best of our knowledge,there is no available tool to visualize easily the results of a processing pipeline. We designed Docforia so that it can store millions of documents and billions of tokens, annotated using different processing tools,that themselves use multiple formats, and compatible with cluster computing frameworks such as Hadoop or Spark. The annotation output, either partial or complete, can then be shared more easily. To validate Docforia, we processed six language versions of Wikipedia: English, French, German, Spanish, Russian, and Swedish, up to semantic role labeling, depending on the NLP tools available for a given language. We stored the results in our document model and we created a visualization tool to inspect the annotation results. },
  author       = {Klang, Marcus and Nugues, Pierre},
  booktitle    = {Linköping Electronic Conference Proceedings },
  issn         = {1650-3686},
  language     = {eng},
  publisher    = {Linköping University Electronic Press},
  title        = {Docforia: A Multilayer Document Model},
  volume       = {131},
  year         = {2017},
}