WikiParq: A Tabulated Wikipedia Resource Using the Parquet Format

Klang, Marcus; Nugues, Pierre

WikiParq: A Tabulated Wikipedia Resource Using the Parquet Format

Mark

Klang, Marcus ^LU

and Nugues, Pierre ^LU

(2016) p.4141-4148

Abstract: Wikipedia has become one of the most popular resources in natural language processing and it is used in quantities of applications. However, Wikipedia requires a substantial pre-processing step before it can be used. For instance, its set of nonstandardized annotations, referred to as the wiki markup, is language-dependent and needs specific parsers from language to language, for English, French, Italian, etc. In addition, the intricacies of the different Wikipedia resources: main article text, categories, wikidata, infoboxes, scattered into the article document or in different files make it difficult to have global view of this outstanding resource. In this paper, we describe WikiParq, a unified format based on the Parquet standard to... (More); Wikipedia has become one of the most popular resources in natural language processing and it is used in quantities of applications. However, Wikipedia requires a substantial pre-processing step before it can be used. For instance, its set of nonstandardized annotations, referred to as the wiki markup, is language-dependent and needs specific parsers from language to language, for English, French, Italian, etc. In addition, the intricacies of the different Wikipedia resources: main article text, categories, wikidata, infoboxes, scattered into the article document or in different files make it difficult to have global view of this outstanding resource. In this paper, we describe WikiParq, a unified format based on the Parquet standard to tabulate and package the Wikipedia corpora. In combination with Spark, a map-reduce computing framework, and the SQL query language, WikiParq makes it much easier to write database queries to extract specific information or subcorpora from Wikipedia, such as all the first paragraphs of the articles in French, or all the articles on persons in Spanish, or all the articles on persons that have versions in French, English, and Spanish. WikiParq is available in six language versions and is potentially extendible to all the languages of Wikipedia. The WikiParq files are downloadable as tarball archives from this location: http://semantica.cs.lth.se/wikiparq/. (Less)

Please use this url to cite or link to this publication: https://lup.lub.lu.se/record/5e2a7be2-9bc2-4d72-bec3-618939d0b729

author

Klang, Marcus ^LU

and Nugues, Pierre ^LU

organization

publishing date

2016-05

type

Chapter in Book/Report/Conference proceeding

publication status

published

subject

Natural Language Processing

host publication

Proceedings of the 10th International Conference on Language Resources and Evaluation (LREC 2016)

pages

4141 - 4148

publisher

European Language Resources Association

external identifiers

scopus:85037160936

ISBN

978-295174089-1

language

English

LU publication?

yes

id

5e2a7be2-9bc2-4d72-bec3-618939d0b729

alternative location

http://www.lrec-conf.org/proceedings/lrec2016/pdf/31_Paper.pdf

date added to LUP

2016-05-14 20:43:24

date last changed

2025-10-14 11:01:56

@inproceedings{5e2a7be2-9bc2-4d72-bec3-618939d0b729,
  abstract     = {{Wikipedia has become one of the most popular resources in natural language processing and it is used in quantities of applications. However, Wikipedia requires a substantial pre-processing step before it can be used. For instance, its set of nonstandardized annotations, referred to as the wiki markup, is language-dependent and needs specific parsers from language to language, for English, French, Italian, etc. In addition, the intricacies of the different Wikipedia resources: main article text, categories, wikidata, infoboxes, scattered into the article document or in different files make it difficult to have global view of this outstanding resource. In this paper, we describe WikiParq, a unified format based on the Parquet standard to tabulate and package the Wikipedia corpora. In combination with Spark, a map-reduce computing framework, and the SQL query language, WikiParq makes it much easier to write database queries to extract specific information or subcorpora from Wikipedia, such as all the first paragraphs of the articles in French, or all the articles on persons in Spanish, or all the articles on persons that have versions in French, English, and Spanish. WikiParq is available in six language versions and is potentially extendible to all the languages of Wikipedia. The WikiParq files are downloadable as tarball archives from this location: http://semantica.cs.lth.se/wikiparq/.}},
  author       = {{Klang, Marcus and Nugues, Pierre}},
  booktitle    = {{Proceedings of the 10th International Conference on Language Resources and Evaluation (LREC 2016)}},
  isbn         = {{978-295174089-1}},
  language     = {{eng}},
  pages        = {{4141--4148}},
  publisher    = {{European Language Resources Association}},
  title        = {{WikiParq: A Tabulated Wikipedia Resource Using the Parquet Format}},
  url          = {{https://lup.lub.lu.se/search/files/7671984/31_Paper_3.pdf}},
  year         = {{2016}},
}

Lund University Publications

LUND UNIVERSITY LIBRARIES

WikiParq: A Tabulated Wikipedia Resource Using the Parquet Format