KOSHIK: A large-scale distributed computing framework for NLP

Exner, Peter; Nugues, Pierre

KOSHIK: A large-scale distributed computing framework for NLP

Mark

Exner, Peter ^LU and Nugues, Pierre ^LU

(2014) 3rd International Conference on Pattern Recognition Applications an Methods (ICPRAM 2014) p.464-470

Abstract: In this paper, we describe KOSHIK, an end-to-end framework to process the unstructured natural language content of multilingual documents. We used the Hadoop distributed computing infrastructure to build this framework as it enables KOSHIK to easily scale by adding inexpensive commodity hardware. We designed an annotation model that allows the processing algorithms to incrementally add layers of annotation without modifyingtheoriginaldocument. We used the Avro binary format to serialize th edocuments. Avro is designed for Hadoop and allows other data warehousing tools to directly query the documents. This paper reports the implementation choices and details of the framework,the annotation model,the options for querying processed data, and... (More); In this paper, we describe KOSHIK, an end-to-end framework to process the unstructured natural language content of multilingual documents. We used the Hadoop distributed computing infrastructure to build this framework as it enables KOSHIK to easily scale by adding inexpensive commodity hardware. We designed an annotation model that allows the processing algorithms to incrementally add layers of annotation without modifyingtheoriginaldocument. We used the Avro binary format to serialize th edocuments. Avro is designed for Hadoop and allows other data warehousing tools to directly query the documents. This paper reports the implementation choices and details of the framework,the annotation model,the options for querying processed data, and the parsing results on the English and Swedish editions of Wikipedia.
(Less)

Please use this url to cite or link to this publication: https://lup.lub.lu.se/record/4352684

author

Exner, Peter ^LU and Nugues, Pierre ^LU

organization

publishing date

2014

type

Chapter in Book/Report/Conference proceeding

publication status

published

subject

Computer Sciences

host publication

3rd International Conference on Pattern Recognition Applications and Methods (ICPRAM 2014)

pages

464 - 470

publisher

SciTePress

conference name

3rd International Conference on Pattern Recognition Applications an Methods (ICPRAM 2014)

conference location

Angers, France

conference dates

2014-03-06 - 2014-03-08

external identifiers

scopus:84902310568

ISBN

978-989-758-018-5

language

English

LU publication?

yes

id

c9fc300c-af19-4b87-be5b-e4547075d01a (old id 4352684)

date added to LUP

2016-04-04 14:00:12

date last changed

2025-10-14 13:15:06

@inproceedings{c9fc300c-af19-4b87-be5b-e4547075d01a,
  abstract     = {{In this paper, we describe KOSHIK, an end-to-end framework to process the unstructured natural language content of multilingual documents. We used the Hadoop distributed computing infrastructure to build this framework as it enables KOSHIK to easily scale by adding inexpensive commodity hardware. We designed an annotation model that allows the processing algorithms to incrementally add layers of annotation without modifyingtheoriginaldocument. We used the Avro binary format to serialize th edocuments. Avro is designed for Hadoop and allows other data warehousing tools to directly query the documents. This paper reports the implementation choices and details of the framework,the annotation model,the options for querying processed data, and the parsing results on the English and Swedish editions of Wikipedia.<br/>}},
  author       = {{Exner, Peter and Nugues, Pierre}},
  booktitle    = {{3rd International Conference on Pattern Recognition Applications and Methods (ICPRAM 2014)}},
  isbn         = {{978-989-758-018-5}},
  language     = {{eng}},
  pages        = {{464--470}},
  publisher    = {{SciTePress}},
  title        = {{KOSHIK: A large-scale distributed computing framework for NLP}},
  url          = {{https://lup.lub.lu.se/search/files/19728738/4352694.pdf}},
  year         = {{2014}},
}

Lund University Publications

LUND UNIVERSITY LIBRARIES

KOSHIK: A large-scale distributed computing framework for NLP