Building Knowledge Graphs : Processing Infrastructure and Named Entity Linking

Klang, Marcus

Building Knowledge Graphs : Processing Infrastructure and Named Entity Linking

Mark

Klang, Marcus ^LU

(2019)

Abstract: Things such as organizations, persons, or locations are ubiquitous in all texts circulating on the internet, particularly in the news, forum posts, and social media. Today, there is more written material than any single person can read through during a typical lifespan. Automatic systems can help us amplify our abilities to find relevant information, where, ideally, a system would learn knowledge from our combined written legacy. Ultimately, this would enable us, one day, to build automatic systems that have reasoning capabilities and can answer any question in any human language.

In this work, I explore methods to represent linguistic structures in text, build processing infrastructures, and how they can be combined to process a... (More); Things such as organizations, persons, or locations are ubiquitous in all texts circulating on the internet, particularly in the news, forum posts, and social media. Today, there is more written material than any single person can read through during a typical lifespan. Automatic systems can help us amplify our abilities to find relevant information, where, ideally, a system would learn knowledge from our combined written legacy. Ultimately, this would enable us, one day, to build automatic systems that have reasoning capabilities and can answer any question in any human language.

In this work, I explore methods to represent linguistic structures in text, build processing infrastructures, and how they can be combined to process a comprehensive collection of documents. The goal is to extract knowledge from text via things, entities. As text, I focused on encyclopedic resources such as Wikipedia.

As knowledge representation, I chose to use graphs, where the entities correspond to graph nodes. To populate such graphs, I created a named entity linker that can find entities in multiple languages such as English, Spanish, and Chinese, and associate them to unique identifiers. In addition, I describe a published state-of-the-art Swedish named entity recognizer that finds mentions of entities in text that I evaluated on the four majority classes in the Stockholm-Umeå Corpus (SUC) 3.0.

To collect the text resources needed for the implementation of the algorithms and the training of the machine-learning models, I also describe a document representation, Docria, that consists of multiple layers of annotations: A model capable of representing structures found in Wikipedia and beyond. Finally, I describe how to construct processing pipelines for large-scale processing with Wikipedia using Docria. (Less)

Please use this url to cite or link to this publication: https://lup.lub.lu.se/record/1abc3ccb-c90b-4e6d-8d0c-673287fbf2de

author

Klang, Marcus ^LU

supervisor

Pierre Nugues ^LU

opponent

Professor Biemann, Chris, Hamburg University, Germany

organization

publishing date

2019-09-17

type

Thesis

publication status

published

subject

Natural Language Processing

keywords

natural language processing, machine learning, computational lingustics, named entity linking

pages

164 pages

publisher

Department of Computer Science, Lund University

defense location

Lecture hall E:1406, building E, Ole Römers väg 3, Lund University, Faculty of Engineering LTH

defense date

2019-10-11 13:15:00

external identifiers

scopus:85063236270

ISSN

1404-1219

ISBN

978-91-7895-286-1

978-91-7895-287-8

language

English

LU publication?

yes

id

1abc3ccb-c90b-4e6d-8d0c-673287fbf2de

date added to LUP

2019-09-12 14:05:06

date last changed

2026-01-10 07:03:15

@phdthesis{1abc3ccb-c90b-4e6d-8d0c-673287fbf2de,
  abstract     = {{Things such as organizations, persons, or locations are ubiquitous in all texts circulating on the internet, particularly in the news, forum posts, and social media. Today, there is more written material than any single person can read through during a typical lifespan. Automatic systems can help us amplify our abilities to find relevant information, where, ideally, a system would learn knowledge from our combined written legacy. Ultimately, this would enable us, one day, to build automatic systems that have reasoning capabilities and can answer any question in any human language.<br/><br/>In this work, I explore methods to represent linguistic structures in text, build processing infrastructures, and how they can be combined to process a comprehensive collection of documents. The goal is to extract knowledge from text via things, entities. As text, I focused on encyclopedic resources such as Wikipedia.<br/><br/>As knowledge representation, I chose to use graphs, where the entities correspond to graph nodes. To populate such graphs, I created a named entity linker that can find entities in multiple languages such as English, Spanish, and Chinese, and associate them to unique identifiers. In addition, I describe a published state-of-the-art Swedish named entity recognizer that finds mentions of entities in text that I evaluated on the four majority classes in the Stockholm-Umeå Corpus (SUC) 3.0. <br/><br/>To collect the text resources needed for the implementation of the algorithms and the training of the machine-learning models, I also describe a document representation, Docria, that consists of multiple layers of annotations: A model capable of representing structures found in Wikipedia and beyond. Finally, I describe how to construct processing pipelines for large-scale processing with Wikipedia using Docria.}},
  author       = {{Klang, Marcus}},
  isbn         = {{978-91-7895-286-1}},
  issn         = {{1404-1219}},
  keywords     = {{natural language processing; machine learning; computational lingustics; named entity linking}},
  language     = {{eng}},
  month        = {{09}},
  publisher    = {{Department of Computer Science, Lund University}},
  school       = {{Lund University}},
  title        = {{Building Knowledge Graphs : Processing Infrastructure and Named Entity Linking}},
  url          = {{https://lup.lub.lu.se/search/files/69709434/Marcus_Corrected_PhD_Thesis.pdf}},
  year         = {{2019}},
}

Lund University Publications

LUND UNIVERSITY LIBRARIES

Building Knowledge Graphs : Processing Infrastructure and Named Entity Linking