Tone restoration in transcribed Kammu: decision-list word sense disambiguation for an unwritten language

Uneson, Marcus

Tone restoration in transcribed Kammu: decision-list word sense disambiguation for an unwritten language

Mark

Uneson, Marcus ^LU (2013) Nodalida 2013 85. p.399-410

Abstract: The RWAAI (Repository and Workspace for Austroasiatic Intangible heritage) project aims at building a digital archive out of existing legacy data from the austroasiatic language family. One aspect of the project is the preservation of analogue legacy data. In this context, we have at our hands a large number of mostly-phonemic transcriptions of narrative monologues, often with accompanying sound recordings, in the unwritten Kammu language of northern Laos. Some of the transcriptions, however, lack tone marks, which for a tonal language such as Kammu makes them substantially less useful. The problem of restoring tones can be recast as one of word sense disambiguation, or, more generally, lexical ambiguity resolution. We attack it by... (More); The RWAAI (Repository and Workspace for Austroasiatic Intangible heritage) project aims at building a digital archive out of existing legacy data from the austroasiatic language family. One aspect of the project is the preservation of analogue legacy data. In this context, we have at our hands a large number of mostly-phonemic transcriptions of narrative monologues, often with accompanying sound recordings, in the unwritten Kammu language of northern Laos. Some of the transcriptions, however, lack tone marks, which for a tonal language such as Kammu makes them substantially less useful. The problem of restoring tones can be recast as one of word sense disambiguation, or, more generally, lexical ambiguity resolution. We attack it by decision lists, along the lines of Yarowsky (1994), using the tone-marked part of the corpus (120kW) as training data. The performance ceiling of this corpus is uncertain: the stories were all annotated, primarily for human rather than machine consumption, by a single person during almost 40 years, with slowly emerging idiosyncratic conventions. Thus, both inter-annotator and intra-annotator agreement figures are unknown. Nevertheless, with the data from this one annotator as a gold standard, we improve from an already-high baseline accuracy of 95.7% to 97.2% (by 10-fold cross-validation). (Less)

Please use this url to cite or link to this publication: https://lup.lub.lu.se/record/3800276

author

Uneson, Marcus ^LU

organization

General Linguistics

publishing date

2013

type

Chapter in Book/Report/Conference proceeding

publication status

published

subject

Comparative Language Studies and Linguistics

keywords

word sense disambiguation, Kammu, decision lists, lexical ambiguity resolution, tone restoration, legacy data

host publication

Linköping Electronic Conference Proceedings

editor

Oepen, Stephan ; Hagen, Kristin and Bondi Johannessen, Janne

volume

85

pages

399 - 410

conference name

Nodalida 2013

conference dates

2013-05-23

ISSN

1650-3740

1650-3686

language

English

LU publication?

yes

additional info

The information about affiliations in this record was updated in December 2015. The record was previously connected to the following departments: Linguistics and Phonetics (015010003)

id

b3c199bf-ea4d-4891-a6a6-dd6349a54da8 (old id 3800276)

alternative location

http://www.ep.liu.se/ecp_article/index.en.aspx?issue=085;article=036

date added to LUP

2016-04-04 08:33:51

date last changed

2025-04-04 15:04:59

@inproceedings{b3c199bf-ea4d-4891-a6a6-dd6349a54da8,
  abstract     = {{The RWAAI (Repository and Workspace for Austroasiatic Intangible heritage) project aims at building a digital archive out of existing legacy data from the austroasiatic language family. One aspect of the project is the preservation of analogue legacy data. In this context, we have at our hands a large number of mostly-phonemic transcriptions of narrative monologues, often with accompanying sound recordings, in the unwritten Kammu language of northern Laos. Some of the transcriptions, however, lack tone marks, which for a tonal language such as Kammu makes them substantially less useful. The problem of restoring tones can be recast as one of word sense disambiguation, or, more generally, lexical ambiguity resolution. We attack it by decision lists, along the lines of Yarowsky (1994), using the tone-marked part of the corpus (120kW) as training data. The performance ceiling of this corpus is uncertain: the stories were all annotated, primarily for human rather than machine consumption, by a single person during almost 40 years, with slowly emerging idiosyncratic conventions. Thus, both inter-annotator and intra-annotator agreement figures are unknown. Nevertheless, with the data from this one annotator as a gold standard, we improve from an already-high baseline accuracy of 95.7% to 97.2% (by 10-fold cross-validation).}},
  author       = {{Uneson, Marcus}},
  booktitle    = {{Linköping Electronic Conference Proceedings}},
  editor       = {{Oepen, Stephan and Hagen, Kristin and Bondi Johannessen, Janne}},
  issn         = {{1650-3740}},
  keywords     = {{word sense disambiguation; Kammu; decision lists; lexical ambiguity resolution; tone restoration; legacy data}},
  language     = {{eng}},
  pages        = {{399--410}},
  title        = {{Tone restoration in transcribed Kammu: decision-list word sense disambiguation for an unwritten language}},
  url          = {{https://lup.lub.lu.se/search/files/5185029/3800349.pdf}},
  volume       = {{85}},
  year         = {{2013}},
}

Lund University Publications

LUND UNIVERSITY LIBRARIES

Tone restoration in transcribed Kammu: decision-list word sense disambiguation for an unwritten language