Advanced

Creating a coreference solver for Swedish and German using distant supervision

Wallin, Alexander LU (2017) In LU-CS-EX 2017-03 EDA920 20162
Department of Computer Science
Abstract
It is said that coreference is difficult to explain, but easy to comprehend; everyoneknows coreference, they just don’t know that they do. We trained a computer toknow it too!

Coreference resolution is the identification of phrases that refer to the same entity in a text. Current techniques to solve coreferences use machine-learning algorithms, which require large annotated data sets. Such annotated resources are not available for most languages today. In this report, we describe a method for solving coreference for Swedish and German without annotated texts using distant supervision. We generate a weakly labelled training set using multi- lingual corpora, where we solve the coreference for English using CoreNLP and transfer it to... (More)
It is said that coreference is difficult to explain, but easy to comprehend; everyoneknows coreference, they just don’t know that they do. We trained a computer toknow it too!

Coreference resolution is the identification of phrases that refer to the same entity in a text. Current techniques to solve coreferences use machine-learning algorithms, which require large annotated data sets. Such annotated resources are not available for most languages today. In this report, we describe a method for solving coreference for Swedish and German without annotated texts using distant supervision. We generate a weakly labelled training set using multi- lingual corpora, where we solve the coreference for English using CoreNLP and transfer it to Swedish and German using word alignment. Additionally, we identify mentions from dependency graphs in both languages using hand- written rules. Finally, we evaluate the end-to-end results using the evaluation framework from the CoNLL 2012 shared task where we obtain an F-measure of 34.98 for Swedish and 13.16 for German. (Less)
Popular Abstract (Swedish)
Coreference is a relationship between two or moreexpressions in a text when these expressions referto the same person or thing. Coreference solving,the identification of sets of coreferring mentionsin a text, is a well-studied problem in the fieldof natural language processing (NLP), the com-putational analysis of text. As an example, con-sider this short text:John drove to Judy’s house.He made her dinner.which contains four nounphrases:John,Judy,heandher. A reader wouldintuitively connectJohnwithheandJudywithherand surmise thatJohncooked dinner forJudy.By using linguistic terms we would say that thereader has solved the text’s coreference and thatthe links the reader previously surmised were core-ferring noun phrases.In most cases the... (More)
Coreference is a relationship between two or moreexpressions in a text when these expressions referto the same person or thing. Coreference solving,the identification of sets of coreferring mentionsin a text, is a well-studied problem in the fieldof natural language processing (NLP), the com-putational analysis of text. As an example, con-sider this short text:John drove to Judy’s house.He made her dinner.which contains four nounphrases:John,Judy,heandher. A reader wouldintuitively connectJohnwithheandJudywithherand surmise thatJohncooked dinner forJudy.By using linguistic terms we would say that thereader has solved the text’s coreference and thatthe links the reader previously surmised were core-ferring noun phrases.In most cases the best coreference solvers are hu-mans, but human labour has a high resource costand would therefore be unfeasible for most tasks;it is often better to train a computer to do thework instead, even though the results are less im-pressive.To train a coreference solver, one would needto gather a large collection of text containingmanually annotated coreferences. The identi-fied coreferences are then used to train a coref-erence solver by comparing coreferring and non-coreferring noun phrases. Some languages are for-tunate with large amounts of training data, while some languages such as Swedish have very smallor nonexistent data sets for this particular task.A good rule of thumb says that the minimumtraining size is in the vicinity of a million words.For Swedish, there exists only one data set with20,000 words. Besides Swedish, many languageslack large training data sets.In the absence of a large annotated data set, dis-tant supervision offers a possible path forward.Distant supervision in the context of our Master’sthesis means that we identify identical sentencesin different languages, solve coreferences for onelanguage, and try to map them to the other lan-guage. The initial solution or the transfer may beincorrect, but given sufficiently large texts the er-rors would hopefully be negligible.The goal for our Master thesis is the creation ofcoreference solvers for the Swedish and Germanlanguages using this method.Although the methods we describe have been usedwith some success in other languages, to the bestof our knowledge, we are the first to create a coref-erence solver for Swedish using this technique.We hope our results will pave the way for the cre-ation of coreference solvers competitive with thecurrent state of the art achieved by supervisedtraining techniques. (Less)
Please use this url to cite or link to this publication:
author
Wallin, Alexander LU
supervisor
organization
alternative title
Lösandet av svensk och tysk koreferens utan handannoterad träningsdata
course
EDA920 20162
year
type
H3 - Professional qualifications (4 Years - )
subject
keywords
coreference resolution, distance supervision, machine-learning, multilin- gual, Swedish, German
publication/series
LU-CS-EX 2017-03
report number
LU-CS-EX 2017-03
ISSN
1650-2884
language
English
id
8904070
date added to LUP
2017-03-03 09:32:42
date last changed
2017-03-03 09:32:42
@misc{8904070,
  abstract     = {It is said that coreference is difficult to explain, but easy to comprehend; everyoneknows coreference, they just don’t know that they do. We trained a computer toknow it too!

Coreference resolution is the identification of phrases that refer to the same entity in a text. Current techniques to solve coreferences use machine-learning algorithms, which require large annotated data sets. Such annotated resources are not available for most languages today. In this report, we describe a method for solving coreference for Swedish and German without annotated texts using distant supervision. We generate a weakly labelled training set using multi- lingual corpora, where we solve the coreference for English using CoreNLP and transfer it to Swedish and German using word alignment. Additionally, we identify mentions from dependency graphs in both languages using hand- written rules. Finally, we evaluate the end-to-end results using the evaluation framework from the CoNLL 2012 shared task where we obtain an F-measure of 34.98 for Swedish and 13.16 for German.},
  author       = {Wallin, Alexander},
  issn         = {1650-2884},
  keyword      = {coreference resolution,distance supervision,machine-learning,multilin- gual,Swedish,German},
  language     = {eng},
  note         = {Student Paper},
  series       = {LU-CS-EX 2017-03},
  title        = {Creating a coreference solver for Swedish and German using distant supervision},
  year         = {2017},
}