Language-Agnostic Age and Gender Classification of Voice using Self-supervised Pre-Training

Lastow, Fredrik; Ekberg, Edwin; Nugues, Pierre

Language-Agnostic Age and Gender Classification of Voice using Self-supervised Pre-Training

Mark

Lastow, Fredrik ; Ekberg, Edwin and Nugues, Pierre ^LU

(2022) 34th Workshop of the Swedish Artificial Intelligence Society, SAIS 2022 In 34th Workshop of the Swedish Artificial Intelligence Society, SAIS 2022

Abstract: Extracting speaker-dependent paralinguistic information out of a person's voice, provides an opportunity for adaptive behaviour related to speaker information in speech processing applications. For instance, in audio-based conversational applications, adapting responses to the attributes of the correspondent is an integral part in making the conversations effective. Two speaker attributes that humans can estimate quite well, based solely on hearing a person speak, is the gender and age of that person. However, in the field of speech processing, age and gender classification are relatively unexplored tasks, especially in a multilingual setting. In most cases, hand-crafted features, such as MFCCs, have been used with some success.... (More); Extracting speaker-dependent paralinguistic information out of a person's voice, provides an opportunity for adaptive behaviour related to speaker information in speech processing applications. For instance, in audio-based conversational applications, adapting responses to the attributes of the correspondent is an integral part in making the conversations effective. Two speaker attributes that humans can estimate quite well, based solely on hearing a person speak, is the gender and age of that person. However, in the field of speech processing, age and gender classification are relatively unexplored tasks, especially in a multilingual setting. In most cases, hand-crafted features, such as MFCCs, have been used with some success. However, recently large transformer networks, utilizing self-supervised pre-Training, have shown promise in creating general speech embeddings for various speech processing tasks. We present a baseline for gender and age detection, in both monolingual and multilingual settings, for multiple state-of-The-Art speech processing models, fine-Tuned for age classification. We created four different datasets with data extracted from the Common Voice project to compare monolingual and multilingual performances. For gender classification, we could reach a macro average F1 score of 96% in both a monolingual and multilingual setting. For age classification, using classes with a size of 10 years, we obtained a macro average mean absolute class error (MACE) of 0.68 and 0.86 on monolingual and multilingual datasets, respectively. For the English TIMIT dataset, we improve upon the previous state of the art for both age regression and gender classification. Our fine-Tuned WavLM model reaches a mean absolute error (MAE) of 4.11 years for males and 4.44 for females in age estimation and our fine-Tuned UniSpeech-SAT model reaches an accuracy of 99.8% for gender classification. All the models were deemed fast enough on a GPU to be used in real-Time settings, and accurate enough, using only a small amount of speech, to be applicable in multilingual speech processing applications.
(Less)

Please use this url to cite or link to this publication: https://lup.lub.lu.se/record/ef830ed9-6417-4e26-a15b-b17163e3cd37

author

Lastow, Fredrik ; Ekberg, Edwin and Nugues, Pierre ^LU

organization

publishing date

2022

type

Chapter in Book/Report/Conference proceeding

publication status

published

subject

Natural Language Processing

host publication

34th Workshop of the Swedish Artificial Intelligence Society, SAIS 2022

series title

34th Workshop of the Swedish Artificial Intelligence Society, SAIS 2022

publisher

IEEE - Institute of Electrical and Electronics Engineers Inc.

conference name

34th Workshop of the Swedish Artificial Intelligence Society, SAIS 2022

conference location

Stockholm, Sweden

conference dates

2022-06-13 - 2022-06-14

external identifiers

scopus:85136150718

ISBN

9781665471268

DOI

10.1109/SAIS55783.2022.9833071

language

English

LU publication?

yes

id

ef830ed9-6417-4e26-a15b-b17163e3cd37

date added to LUP

2022-09-08 12:26:26

date last changed

2025-10-14 13:10:13

@inproceedings{ef830ed9-6417-4e26-a15b-b17163e3cd37,
  abstract     = {{<p>Extracting speaker-dependent paralinguistic information out of a person's voice, provides an opportunity for adaptive behaviour related to speaker information in speech processing applications. For instance, in audio-based conversational applications, adapting responses to the attributes of the correspondent is an integral part in making the conversations effective. Two speaker attributes that humans can estimate quite well, based solely on hearing a person speak, is the gender and age of that person. However, in the field of speech processing, age and gender classification are relatively unexplored tasks, especially in a multilingual setting. In most cases, hand-crafted features, such as MFCCs, have been used with some success. However, recently large transformer networks, utilizing self-supervised pre-Training, have shown promise in creating general speech embeddings for various speech processing tasks. We present a baseline for gender and age detection, in both monolingual and multilingual settings, for multiple state-of-The-Art speech processing models, fine-Tuned for age classification. We created four different datasets with data extracted from the Common Voice project to compare monolingual and multilingual performances. For gender classification, we could reach a macro average F1 score of 96% in both a monolingual and multilingual setting. For age classification, using classes with a size of 10 years, we obtained a macro average mean absolute class error (MACE) of 0.68 and 0.86 on monolingual and multilingual datasets, respectively. For the English TIMIT dataset, we improve upon the previous state of the art for both age regression and gender classification. Our fine-Tuned WavLM model reaches a mean absolute error (MAE) of 4.11 years for males and 4.44 for females in age estimation and our fine-Tuned UniSpeech-SAT model reaches an accuracy of 99.8% for gender classification. All the models were deemed fast enough on a GPU to be used in real-Time settings, and accurate enough, using only a small amount of speech, to be applicable in multilingual speech processing applications. </p>}},
  author       = {{Lastow, Fredrik and Ekberg, Edwin and Nugues, Pierre}},
  booktitle    = {{34th Workshop of the Swedish Artificial Intelligence Society, SAIS 2022}},
  isbn         = {{9781665471268}},
  language     = {{eng}},
  publisher    = {{IEEE - Institute of Electrical and Electronics Engineers Inc.}},
  series       = {{34th Workshop of the Swedish Artificial Intelligence Society, SAIS 2022}},
  title        = {{Language-Agnostic Age and Gender Classification of Voice using Self-supervised Pre-Training}},
  url          = {{http://dx.doi.org/10.1109/SAIS55783.2022.9833071}},
  doi          = {{10.1109/SAIS55783.2022.9833071}},
  year         = {{2022}},
}

Lund University Publications

LUND UNIVERSITY LIBRARIES

Language-Agnostic Age and Gender Classification of Voice using Self-supervised Pre-Training