Advanced

Perception, Analysis and Synthesis of Speaker Age

Schötz, Susanne LU (2006) In Travaux de l'Institut de Linguistique de Lund 47.
Abstract (Swedish)
Popular Abstract in Swedish

Talarålder är en viktig paralingvistisk egenskap i tal, som bör beaktas vid studiet av fonetisk variation. Kunskap om talarålder kan användas för att förbättra talteknologiska tillämpningar såsom automatisk taligenkänning och talsyntes. Föreliggande doktorsavhandling beskriver sex studier som undersökt ett flertal aspekter av den åldersrelaterade variationen i tal.



När talapparaten förändras från tidig vuxen till hög ålder, påverkas talet på flera vis. Människor kan ganska väl bedöma talarålder med hjälp av ledtrådar i bl.a. röstläget, taltempot och röstkvaliteten. Det är dock ännu oklart vilka ledtrådar som är viktigast. Den första studien i denna avhandling undersökte hur... (More)
Popular Abstract in Swedish

Talarålder är en viktig paralingvistisk egenskap i tal, som bör beaktas vid studiet av fonetisk variation. Kunskap om talarålder kan användas för att förbättra talteknologiska tillämpningar såsom automatisk taligenkänning och talsyntes. Föreliggande doktorsavhandling beskriver sex studier som undersökt ett flertal aspekter av den åldersrelaterade variationen i tal.



När talapparaten förändras från tidig vuxen till hög ålder, påverkas talet på flera vis. Människor kan ganska väl bedöma talarålder med hjälp av ledtrådar i bl.a. röstläget, taltempot och röstkvaliteten. Det är dock ännu oklart vilka ledtrådar som är viktigast. Den första studien i denna avhandling undersökte hur grundtonsfrekvens (F0) och taltempo (ordduration) påverkar lyssnares perception av talarålder. Resultaten visade att dessa drag verkar vara mindre viktiga än spektrala drag (t.ex. formantfrekvenser), men också att båda dessa drag ändå korrelerade med både kronologisk och bedömd ålder.



I den andra studien jämfördes två olika stimulustyper (ord och spontantal) av olika längd. Det visade sig att större stimulusdurationer (oavsett typ) verkar förbättra bedömning av kvinnlig talarålder, medan spontantal (oavsett duration) verkar innehålla viktigare ledtrådar för perception av manlig ålder.



I de två följande studierna konstruerades flera automatiska bedömare av talarålder.



Med dessa undersöktes en mängd akustiska drag som kan vara relevanta vid maskinbedömning av ålder, där prosodiska drag verkade vara viktigare för uppskattning av kvinnlig ålder, men spektrala drag (t.ex. F2) för manlig. De automatiska åldersbedömarna uppnådde dock inte samma prestanda som mänskliga lyssnare.



Även om åtskilliga akustiska korrelat till talarålder är kända, har deras relativa betydelse ännu inte fastställts. I nästa studie analyserades 161 akustiska drag, som mättes automatiskt i sex ord uttalade av 547 talare. Normaliserade medelvärden användes för att göra direkta jämförelser av de olika dragen. Taltempo (segmentduration) och intensitetsomfång identifierades som de viktigaste akustiska korrelaten till talarålder. F0 och en del spektrala drag (t.ex. F1 och F2) verkar dock också kunna användas som åldersledtrådar ? åtminstone tillsammans med andra drag.



Syntetiskt tal skulle kunna låta mer naturligt om talarålder ingick som en parameter. I den sista studien utvecklades ett forskningsverktyg för simulering av talarålder med datadriven formantsyntes och åldersviktad linjär interpolation mellan åldrarna hos fyra kvinnliga referenstalare. En utvärdering av verktyget visade att syntetiska röster med simulerad ålder bedömdes som ungefär lika gamla som naturliga röster i samma ålder. Verktyget kommer att användas i vidare studier för analys genom syntes av talarålder. (Less)
Abstract
Speaker age is an important paralinguistic feature in speech which has to be considered in the study of phonetic variation. Knowledge about this feature may be used to improve speech technology applications, e.g. automatic speech recognition and speech synthesis. The present thesis describes six studies of several phonetic aspects of age-related variation in speech.



As the speech production mechanism changes from young adulthood to old age, speech is affected in numerous ways. Human perception of speaker age is based on cues such as pitch, speech rate and voice quality, and is fairly accurate. However, it is still unclear which cues are the most important ones. The first study included in this thesis investigated the... (More)
Speaker age is an important paralinguistic feature in speech which has to be considered in the study of phonetic variation. Knowledge about this feature may be used to improve speech technology applications, e.g. automatic speech recognition and speech synthesis. The present thesis describes six studies of several phonetic aspects of age-related variation in speech.



As the speech production mechanism changes from young adulthood to old age, speech is affected in numerous ways. Human perception of speaker age is based on cues such as pitch, speech rate and voice quality, and is fairly accurate. However, it is still unclear which cues are the most important ones. The first study included in this thesis investigated the role of F0 and speech rate (word duration) in age perception. It was found that while these cues may be less important than spectral ones (e.g. formant frequencies), they still correlate with chronological as well as perceived age.



In the second study, two stimulus types of various lengths were compared. Results indicated that while longer stimulus duration (regardless of speech type) seems to improve the age estimation of females, spontaneous speech (regardless of duration) appears to contain more important cues for perception of male speaker age.



In the next two studies, several automatic estimators of speaker age were built, none of which reached the same accuracy as humans. Important features in machine perception of age were also investigated. It was found that prosodic features seem to be more important in the estimation of female age, while spectral features (e.g. F2 ) appear to be more important for male age.



Although several acoustic correlates of speaker age are known, their relative importance has not yet been established. The next study analysed 161 features, automatically extracted from segments in six words produced by 527 speakers. Normalised means were used to ensure that the features could be compared directly. The most important acoustic correlates of speaker age were identified to be speech rate (segment duration) and intensity range. However, F0 and some spectral measures (e.g. F1 and F2 ) may also, if used in combination with other features, be important correlates of age.



Synthetic speech may sound more natural if speaker age is included as a parameter. The final study developed a research tool which used data- driven formant synthesis and age-weighted linear interpolation to simulate an age between the ages of any two of four female differently aged reference speakers. Evaluation of the tool showed that speaker age may in fact be simulated using formant synthesis. The tool will be used in further studies of analysis by synthesis of speaker age. (Less)
Please use this url to cite or link to this publication:
author
supervisor
opponent
  • Associate Professor Möbius, Bernd, Institut für Maschinelle Sprachverarbeitung, Stuttgart University
organization
publishing date
type
Thesis
publication status
published
subject
keywords
phonology, perceptual cues, speaker age, automatic speaker recognition, acoustic analysis, acoustic correlates, data-driven, Phonetics, formant synthesis, Fonetik, fonologi, Technological sciences, Teknik
in
Travaux de l'Institut de Linguistique de Lund
volume
47
pages
200 pages
publisher
Linguistics and Phonetics
defense location
Hörsalen, Humanisthuset, Språk-och Litteraturcentrum, Helgonabacken 12, Lund
defense date
2006-12-02 13:15
ISSN
0347-2558
ISBN
91-974116-4-7
project
SweDia 2000 - De svenska dialekternas fonetik och fonologi år 2000
language
English
LU publication?
yes
id
6d194c9f-2a71-486d-b560-f61cddb3d8e8 (old id 25959)
date added to LUP
2007-06-05 13:55:31
date last changed
2018-11-21 20:47:32
@phdthesis{6d194c9f-2a71-486d-b560-f61cddb3d8e8,
  abstract     = {Speaker age is an important paralinguistic feature in speech which has to be considered in the study of phonetic variation. Knowledge about this feature may be used to improve speech technology applications, e.g. automatic speech recognition and speech synthesis. The present thesis describes six studies of several phonetic aspects of age-related variation in speech.<br/><br>
<br/><br>
As the speech production mechanism changes from young adulthood to old age, speech is affected in numerous ways. Human perception of speaker age is based on cues such as pitch, speech rate and voice quality, and is fairly accurate. However, it is still unclear which cues are the most important ones. The first study included in this thesis investigated the role of F0 and speech rate (word duration) in age perception. It was found that while these cues may be less important than spectral ones (e.g. formant frequencies), they still correlate with chronological as well as perceived age.<br/><br>
<br/><br>
In the second study, two stimulus types of various lengths were compared. Results indicated that while longer stimulus duration (regardless of speech type) seems to improve the age estimation of females, spontaneous speech (regardless of duration) appears to contain more important cues for perception of male speaker age.<br/><br>
<br/><br>
In the next two studies, several automatic estimators of speaker age were built, none of which reached the same accuracy as humans. Important features in machine perception of age were also investigated. It was found that prosodic features seem to be more important in the estimation of female age, while spectral features (e.g. F2 ) appear to be more important for male age.<br/><br>
<br/><br>
Although several acoustic correlates of speaker age are known, their relative importance has not yet been established. The next study analysed 161 features, automatically extracted from segments in six words produced by 527 speakers. Normalised means were used to ensure that the features could be compared directly. The most important acoustic correlates of speaker age were identified to be speech rate (segment duration) and intensity range. However, F0 and some spectral measures (e.g. F1 and F2 ) may also, if used in combination with other features, be important correlates of age.<br/><br>
<br/><br>
Synthetic speech may sound more natural if speaker age is included as a parameter. The final study developed a research tool which used data- driven formant synthesis and age-weighted linear interpolation to simulate an age between the ages of any two of four female differently aged reference speakers. Evaluation of the tool showed that speaker age may in fact be simulated using formant synthesis. The tool will be used in further studies of analysis by synthesis of speaker age.},
  author       = {Schötz, Susanne},
  isbn         = {91-974116-4-7},
  issn         = {0347-2558},
  keyword      = {phonology,perceptual cues,speaker age,automatic speaker recognition,acoustic analysis,acoustic correlates,data-driven,Phonetics,formant synthesis,Fonetik,fonologi,Technological sciences,Teknik},
  language     = {eng},
  pages        = {200},
  publisher    = {Linguistics and Phonetics},
  school       = {Lund University},
  series       = {Travaux de l'Institut de Linguistique de Lund},
  title        = {Perception, Analysis and Synthesis of Speaker Age},
  volume       = {47},
  year         = {2006},
}