Skip to main content

LUP Student Papers

LUND UNIVERSITY LIBRARIES

HERD - Hajen Entity Recognition and Disambiguation

Södergren, Anton LU (2016) In LU-CS-EX 2016-21 EDA920 20161
Department of Computer Science
Abstract
This thesis describes the process to build an entity recognizer and disambiguator, named HERD. The goal of the system is to find mentions of entities in text and link those mentions to a unique identifier. This system is designed to be multilingual and has versions in English, French and Swedish.

I use Wikipedia as a knowledge source of both names and concepts, and Wikidata, a language agnostic, structured knowledge source, for unique identifiers. The system collects the links on Wikipedia articles to count and analyze them. The link is seen as a mention, that consists of a label and an address, that the system uses as a name and an identifier. The address is translated into a Wikidata Q-number. When the system parses a new document,... (More)
This thesis describes the process to build an entity recognizer and disambiguator, named HERD. The goal of the system is to find mentions of entities in text and link those mentions to a unique identifier. This system is designed to be multilingual and has versions in English, French and Swedish.

I use Wikipedia as a knowledge source of both names and concepts, and Wikidata, a language agnostic, structured knowledge source, for unique identifiers. The system collects the links on Wikipedia articles to count and analyze them. The link is seen as a mention, that consists of a label and an address, that the system uses as a name and an identifier. The address is translated into a Wikidata Q-number. When the system parses a new document, each recognized name is linked to a unique identifier.

I have explored logistic regression, PageRank, and feature vectors based on the Wikipedia categories to improve the name recognition, and select the best candidate for each name.

The system is evaluated with the same method as used in the ERD’14 competition, and reached an F1-score of 0.701, which would have placed it 6th, out of 17 competitors, 6 percentage points lower than the highest scoring participant. (Less)
Popular Abstract (Swedish)
För att samla stora mängder information om personer, platser och organisationer, behöver vi kunna analysera vanliga texter skrivna i naturligt språk. Detta arbete bidrar till det, genom att känna igen namn och länka dem till rätt Wikipedia-artikel.
Please use this url to cite or link to this publication:
author
Södergren, Anton LU
supervisor
organization
alternative title
HERD - Namnigenkänning och identifiering
course
EDA920 20161
year
type
H3 - Professional qualifications (4 Years - )
subject
publication/series
LU-CS-EX 2016-21
report number
LU-CS-EX 2016-21
ISSN
1650-2884
language
English
id
8883859
date added to LUP
2016-06-21 14:00:22
date last changed
2016-06-21 14:00:22
@misc{8883859,
  abstract     = {{This thesis describes the process to build an entity recognizer and disambiguator, named HERD. The goal of the system is to find mentions of entities in text and link those mentions to a unique identifier. This system is designed to be multilingual and has versions in English, French and Swedish. 

I use Wikipedia as a knowledge source of both names and concepts, and Wikidata, a language agnostic, structured knowledge source, for unique identifiers. The system collects the links on Wikipedia articles to count and analyze them. The link is seen as a mention, that consists of a label and an address, that the system uses as a name and an identifier. The address is translated into a Wikidata Q-number. When the system parses a new document, each recognized name is linked to a unique identifier.

I have explored logistic regression, PageRank, and feature vectors based on the Wikipedia categories to improve the name recognition, and select the best candidate for each name.

The system is evaluated with the same method as used in the ERD’14 competition, and reached an F1-score of 0.701, which would have placed it 6th, out of 17 competitors, 6 percentage points lower than the highest scoring participant.}},
  author       = {{Södergren, Anton}},
  issn         = {{1650-2884}},
  language     = {{eng}},
  note         = {{Student Paper}},
  series       = {{LU-CS-EX 2016-21}},
  title        = {{HERD - Hajen Entity Recognition and Disambiguation}},
  year         = {{2016}},
}