Challenges of releasing audio material for spoken data : The case of the London–Lund Corpus 2

Pöldvere, Nele; Frid, Johan; Johansson, Victoria; Paradis, Carita

Challenges of releasing audio material for spoken data : The case of the London–Lund Corpus 2

Mark

Pöldvere, Nele ^LU

; Frid, Johan ^LU

; Johansson, Victoria ^LU and Paradis, Carita ^LU

(2021) In Research in Corpus Linguistics 9(1). p.35-62

Abstract: This article aims to describe key challenges of preparing and releasing audio material for spoken data and to propose solutions to these challenges. We draw on our experience of compiling the new London-Lund Corpus 2 (LLC-2), where transcripts are released together with the audio files. However, making the audio material publicly available required careful consideration of how to, most effectively, 1) align the transcripts with the audio and 2) anonymise personal information in the recordings. First, audio-to-text alignment was solved through the insertion of timestamps in front of speaker turns in the transcription stage, which, as we show in the article, may later be used as a valuable complement to more robust automatic segmentation.... (More); This article aims to describe key challenges of preparing and releasing audio material for spoken data and to propose solutions to these challenges. We draw on our experience of compiling the new London-Lund Corpus 2 (LLC-2), where transcripts are released together with the audio files. However, making the audio material publicly available required careful consideration of how to, most effectively, 1) align the transcripts with the audio and 2) anonymise personal information in the recordings. First, audio-to-text alignment was solved through the insertion of timestamps in front of speaker turns in the transcription stage, which, as we show in the article, may later be used as a valuable complement to more robust automatic segmentation. Second, anonymisation was done by means of a Praat script, which replaced all personal information with a sound that made the lexical information incomprehensible but retained the prosodic characteristics. The public release of the LLC-2 audio material is a valuable feature of the corpus that allows users to extend the corpus data relative to their own research interests and, thus, broaden the scope of corpus linguistics. To illustrate this, we present three studies that have successfully used the LLC-2 audio material. (Less)

Please use this url to cite or link to this publication: https://lup.lub.lu.se/record/7f479b4e-f51e-40b3-b705-bb7fd5b029be

author

Pöldvere, Nele ^LU

; Frid, Johan ^LU

; Johansson, Victoria ^LU and Paradis, Carita ^LU

organization

publishing date

2021

type

Contribution to journal

publication status

published

subject

Comparative Language Studies and Linguistics

keywords

audio-to-text alignment, anonymisation, corpus compilation, spoken corpora, prosody, Praat

in

Research in Corpus Linguistics

volume

9

issue

1

pages

28 pages

publisher

Spanish Association for Corpus Linguistics

external identifiers

scopus:85114329797

ISSN

2243-4712

DOI

10.32714/ricl.09.01.04

project

The London-Lund Corpus 2 of spoken British English (LLC 2)

language

English

LU publication?

yes

id

7f479b4e-f51e-40b3-b705-bb7fd5b029be

date added to LUP

2021-06-07 16:48:51

date last changed

2026-01-13 04:09:06

@article{7f479b4e-f51e-40b3-b705-bb7fd5b029be,
  abstract     = {{This article aims to describe key challenges of preparing and releasing audio material for spoken data and to propose solutions to these challenges. We draw on our experience of compiling the new London-Lund Corpus 2 (LLC-2), where transcripts are released together with the audio files. However, making the audio material publicly available required careful consideration of how to, most effectively, 1) align the transcripts with the audio and 2) anonymise personal information in the recordings. First, audio-to-text alignment was solved through the insertion of timestamps in front of speaker turns in the transcription stage, which, as we show in the article, may later be used as a valuable complement to more robust automatic segmentation. Second, anonymisation was done by means of a Praat script, which replaced all personal information with a sound that made the lexical information incomprehensible but retained the prosodic characteristics. The public release of the LLC-2 audio material is a valuable feature of the corpus that allows users to extend the corpus data relative to their own research interests and, thus, broaden the scope of corpus linguistics. To illustrate this, we present three studies that have successfully used the LLC-2 audio material.}},
  author       = {{Pöldvere, Nele and Frid, Johan and Johansson, Victoria and Paradis, Carita}},
  issn         = {{2243-4712}},
  keywords     = {{audio-to-text alignment; anonymisation; corpus compilation; spoken corpora; prosody; Praat}},
  language     = {{eng}},
  number       = {{1}},
  pages        = {{35--62}},
  publisher    = {{Spanish Association for Corpus Linguistics}},
  series       = {{Research in Corpus Linguistics}},
  title        = {{Challenges of releasing audio material for spoken data : The case of the London–Lund Corpus 2}},
  url          = {{http://dx.doi.org/10.32714/ricl.09.01.04}},
  doi          = {{10.32714/ricl.09.01.04}},
  volume       = {{9}},
  year         = {{2021}},
}

Lund University Publications

LUND UNIVERSITY LIBRARIES

Challenges of releasing audio material for spoken data : The case of the London–Lund Corpus 2