HERD - Hajen Entity Recognition and Disambiguation

Södergren, Anton

HERD - Hajen Entity Recognition and Disambiguation

Mark

Södergren, Anton ^LU (2016) In LU-CS-EX 2016-21 EDA920 20161
Department of Computer Science

Abstract: This thesis describes the process to build an entity recognizer and disambiguator, named HERD. The goal of the system is to find mentions of entities in text and link those mentions to a unique identifier. This system is designed to be multilingual and has versions in English, French and Swedish.

I use Wikipedia as a knowledge source of both names and concepts, and Wikidata, a language agnostic, structured knowledge source, for unique identifiers. The system collects the links on Wikipedia articles to count and analyze them. The link is seen as a mention, that consists of a label and an address, that the system uses as a name and an identifier. The address is translated into a Wikidata Q-number. When the system parses a new document,... (More); This thesis describes the process to build an entity recognizer and disambiguator, named HERD. The goal of the system is to find mentions of entities in text and link those mentions to a unique identifier. This system is designed to be multilingual and has versions in English, French and Swedish.

I use Wikipedia as a knowledge source of both names and concepts, and Wikidata, a language agnostic, structured knowledge source, for unique identifiers. The system collects the links on Wikipedia articles to count and analyze them. The link is seen as a mention, that consists of a label and an address, that the system uses as a name and an identifier. The address is translated into a Wikidata Q-number. When the system parses a new document, each recognized name is linked to a unique identifier.

I have explored logistic regression, PageRank, and feature vectors based on the Wikipedia categories to improve the name recognition, and select the best candidate for each name.

The system is evaluated with the same method as used in the ERD’14 competition, and reached an F1-score of 0.701, which would have placed it 6th, out of 17 competitors, 6 percentage points lower than the highest scoring participant. (Less)
Popular Abstract (Swedish): För att samla stora mängder information om personer, platser och organisationer, behöver vi kunna analysera vanliga texter skrivna i naturligt språk. Detta arbete bidrar till det, genom att känna igen namn och länka dem till rätt Wikipedia-artikel.

Please use this url to cite or link to this publication: http://lup.lub.lu.se/student-papers/record/8883859

author

Södergren, Anton ^LU

supervisor

Pierre Nugues ^LU

organization

Department of Computer Science

alternative title

HERD - Namnigenkänning och identifiering

course

EDA920 20161

year

2016

type

H3 - Professional qualifications (4 Years - )

subject

Technology and Engineering

publication/series

LU-CS-EX 2016-21

report number

LU-CS-EX 2016-21

ISSN

1650-2884

language

English

id

8883859

date added to LUP

2016-06-21 14:00:22

date last changed

2016-06-21 14:00:22

@misc{8883859,
  abstract     = {{This thesis describes the process to build an entity recognizer and disambiguator, named HERD. The goal of the system is to find mentions of entities in text and link those mentions to a unique identifier. This system is designed to be multilingual and has versions in English, French and Swedish. 

I use Wikipedia as a knowledge source of both names and concepts, and Wikidata, a language agnostic, structured knowledge source, for unique identifiers. The system collects the links on Wikipedia articles to count and analyze them. The link is seen as a mention, that consists of a label and an address, that the system uses as a name and an identifier. The address is translated into a Wikidata Q-number. When the system parses a new document, each recognized name is linked to a unique identifier.

I have explored logistic regression, PageRank, and feature vectors based on the Wikipedia categories to improve the name recognition, and select the best candidate for each name.

The system is evaluated with the same method as used in the ERD’14 competition, and reached an F1-score of 0.701, which would have placed it 6th, out of 17 competitors, 6 percentage points lower than the highest scoring participant.}},
  author       = {{Södergren, Anton}},
  issn         = {{1650-2884}},
  language     = {{eng}},
  note         = {{Student Paper}},
  series       = {{LU-CS-EX 2016-21}},
  title        = {{HERD - Hajen Entity Recognition and Disambiguation}},
  year         = {{2016}},
}

LUP Student Papers

LUND UNIVERSITY LIBRARIES

HERD - Hajen Entity Recognition and Disambiguation