Docforia : A Multilayer Document Model

Klang, Marcus; Nugues, Pierre

Docforia : A Multilayer Document Model

Mark

Klang, Marcus ^LU

and Nugues, Pierre ^LU

(2017) 21st Nordic Conference of Computational Linguistics, NoDaLiDa 2017 In NoDaLiDa 2017 - 21st Nordic Conference of Computational Linguistics, Proceedings of the Conference p.226-230

Abstract: In this paper, we describe Docforia, a multilayer document model and application programming interface (API) to store formatting, lexical, syntactic, and semantic annotations on Wikipedia and other kinds of text and visualize them. While Wikipedia has become a major NLP resource, its scale and heterogeneity makes it relatively difficult to do experimentations on the whole corpus. These experimentations are rendered even more complex as, to the best of our knowledge, there is no available tool to visualize easily the results of a processing pipeline. We designed Docforia so that it can store millions of documents and billions of tokens, annotated using different processing tools, that themselves use multiple formats, and compatible with... (More); In this paper, we describe Docforia, a multilayer document model and application programming interface (API) to store formatting, lexical, syntactic, and semantic annotations on Wikipedia and other kinds of text and visualize them. While Wikipedia has become a major NLP resource, its scale and heterogeneity makes it relatively difficult to do experimentations on the whole corpus. These experimentations are rendered even more complex as, to the best of our knowledge, there is no available tool to visualize easily the results of a processing pipeline. We designed Docforia so that it can store millions of documents and billions of tokens, annotated using different processing tools, that themselves use multiple formats, and compatible with cluster computing frameworks such as Hadoop or Spark. The annotation output, either partial or complete, can then be shared more easily. To validate Docforia, we processed six language versions of Wikipedia: English, French, German, Spanish, Russian, and Swedish, up to semantic role labeling, depending on the NLP tools available for a given language. We stored the results in our document model and we created a visualization tool to inspect the annotation results.
(Less)

Please use this url to cite or link to this publication: https://lup.lub.lu.se/record/635bc286-642b-4bc1-b070-6e009c054130

author

Klang, Marcus ^LU

and Nugues, Pierre ^LU

organization

publishing date

2017

type

Chapter in Book/Report/Conference proceeding

publication status

published

subject

Computer and Information Sciences

host publication

NoDaLiDa 2017 - 21st Nordic Conference of Computational Linguistics, Proceedings of the Conference

series title

NoDaLiDa 2017 - 21st Nordic Conference of Computational Linguistics, Proceedings of the Conference

editor

Tiedemann, Jorg

pages

5 pages

publisher

Association for Computational Linguistics (ACL)

conference name

21st Nordic Conference of Computational Linguistics, NoDaLiDa 2017

conference location

Gothenburg, Sweden

conference dates

2017-05-23 - 2017-05-24

external identifiers

scopus:85123002548

ISBN

9789176856017

language

English

LU publication?

yes

id

635bc286-642b-4bc1-b070-6e009c054130

date added to LUP

2022-03-09 13:32:21

date last changed

2025-10-14 08:55:46

@inproceedings{635bc286-642b-4bc1-b070-6e009c054130,
  abstract     = {{<p>In this paper, we describe Docforia, a multilayer document model and application programming interface (API) to store formatting, lexical, syntactic, and semantic annotations on Wikipedia and other kinds of text and visualize them. While Wikipedia has become a major NLP resource, its scale and heterogeneity makes it relatively difficult to do experimentations on the whole corpus. These experimentations are rendered even more complex as, to the best of our knowledge, there is no available tool to visualize easily the results of a processing pipeline. We designed Docforia so that it can store millions of documents and billions of tokens, annotated using different processing tools, that themselves use multiple formats, and compatible with cluster computing frameworks such as Hadoop or Spark. The annotation output, either partial or complete, can then be shared more easily. To validate Docforia, we processed six language versions of Wikipedia: English, French, German, Spanish, Russian, and Swedish, up to semantic role labeling, depending on the NLP tools available for a given language. We stored the results in our document model and we created a visualization tool to inspect the annotation results.</p>}},
  author       = {{Klang, Marcus and Nugues, Pierre}},
  booktitle    = {{NoDaLiDa 2017 - 21st Nordic Conference of Computational Linguistics, Proceedings of the Conference}},
  editor       = {{Tiedemann, Jorg}},
  isbn         = {{9789176856017}},
  language     = {{eng}},
  pages        = {{226--230}},
  publisher    = {{Association for Computational Linguistics (ACL)}},
  series       = {{NoDaLiDa 2017 - 21st Nordic Conference of Computational Linguistics, Proceedings of the Conference}},
  title        = {{Docforia : A Multilayer Document Model}},
  year         = {{2017}},
}

Lund University Publications

LUND UNIVERSITY LIBRARIES

Docforia : A Multilayer Document Model