Advanced

Pairing Wikipedia Articles Across Languages

Klang, Marcus LU and Nugues, Pierre LU (2016) Open Knowledge Base and Question Answering (OKBQA) Workshop p.72-76
Abstract
Wikipedia has become a reference knowledge source for scores of NLP applications. One of its invaluable features lies in its multilingual nature, where articles on a same entity or concept can have from one to more than 200 different versions. The interlinking of language versions in Wikipedia has undergone a major renewal with the advent of Wikidata, a unified scheme to identify entities and their properties using unique numbers. However, as the interlinking is still manuallycarriedoutbythousandsofeditorsacrosstheglobe,errorsmaycreepintheassignment ofentities. Inthispaper,wedescribeanoptimizationtechniquetomatchautomaticallylanguage versions of articles, and hence entities, that is only based on bags of words and anchors. We created a... (More)
Wikipedia has become a reference knowledge source for scores of NLP applications. One of its invaluable features lies in its multilingual nature, where articles on a same entity or concept can have from one to more than 200 different versions. The interlinking of language versions in Wikipedia has undergone a major renewal with the advent of Wikidata, a unified scheme to identify entities and their properties using unique numbers. However, as the interlinking is still manuallycarriedoutbythousandsofeditorsacrosstheglobe,errorsmaycreepintheassignment ofentities. Inthispaper,wedescribeanoptimizationtechniquetomatchautomaticallylanguage versions of articles, and hence entities, that is only based on bags of words and anchors. We created a dataset of all the articles on persons we extracted from Wikipedia in six languages: English, French, German, Russian, Spanish, and Swedish. We report a correct match of at least 94.3% on each pair. (Less)
Please use this url to cite or link to this publication:
author
organization
publishing date
type
Chapter in Book/Report/Conference proceeding
publication status
published
subject
host publication
Proceedings of the Open Knowledge Base and Question Answering Workshop (OKBQA 2016)
pages
72 - 76
publisher
The COLING 2016 Organizing Committee
conference name
Open Knowledge Base and Question Answering (OKBQA) Workshop
conference location
Osaka, Japan
conference dates
2016-12-11 - 2016-12-11
ISBN
978-4-87974-712-9
language
English
LU publication?
yes
id
10b176f2-f95c-492c-9cf7-f29c82acee70
alternative location
http://www.aclweb.org/anthology/W/W16/W16-4410.pdf
date added to LUP
2016-12-11 11:26:33
date last changed
2019-03-08 03:05:23
@inproceedings{10b176f2-f95c-492c-9cf7-f29c82acee70,
  abstract     = {Wikipedia has become a reference knowledge source for scores of NLP applications. One of its invaluable features lies in its multilingual nature, where articles on a same entity or concept can have from one to more than 200 different versions. The interlinking of language versions in Wikipedia has undergone a major renewal with the advent of Wikidata, a unified scheme to identify entities and their properties using unique numbers. However, as the interlinking is still manuallycarriedoutbythousandsofeditorsacrosstheglobe,errorsmaycreepintheassignment ofentities. Inthispaper,wedescribeanoptimizationtechniquetomatchautomaticallylanguage versions of articles, and hence entities, that is only based on bags of words and anchors. We created a dataset of all the articles on persons we extracted from Wikipedia in six languages: English, French, German, Russian, Spanish, and Swedish. We report a correct match of at least 94.3% on each pair.},
  author       = {Klang, Marcus and Nugues, Pierre},
  isbn         = {978-4-87974-712-9},
  language     = {eng},
  location     = {Osaka, Japan},
  pages        = {72--76},
  publisher    = {The COLING 2016 Organizing Committee},
  title        = {Pairing Wikipedia Articles Across Languages},
  year         = {2016},
}