Skip to main content

LUP Student Papers

LUND UNIVERSITY LIBRARIES

Speaker Recognition using Biology-Inspired Feature Extraction

Andersson, Edvin LU (2021) EITM01 20211
Department of Electrical and Information Technology
Abstract
Distinguishing between people's voices is something the human brain does naturally, using only frequencies picked up by the inner ear. The field of speaker recognition is concerned with making machines do the same thing using digitally sampled speech and data processing. The processing extracts relevant information about the speech from the high dimensional acoustic data which can help the machine understand to which speaker a speech sample belongs. Several methods exist to solve this problem, most of which are based on modelling a sample as a sequence of time frames, each representing the current frequency characteristics of the sound input. A common choice of frequency characteristics are Mel-Frequency Cepstral Coefficients (MFCC), which... (More)
Distinguishing between people's voices is something the human brain does naturally, using only frequencies picked up by the inner ear. The field of speaker recognition is concerned with making machines do the same thing using digitally sampled speech and data processing. The processing extracts relevant information about the speech from the high dimensional acoustic data which can help the machine understand to which speaker a speech sample belongs. Several methods exist to solve this problem, most of which are based on modelling a sample as a sequence of time frames, each representing the current frequency characteristics of the sound input. A common choice of frequency characteristics are Mel-Frequency Cepstral Coefficients (MFCC), which represent the overall shape of the frequency spectrum representation of the input during each time frame. This thesis presents a different approach, inspired by findings of how the human brain processes tactile sensory input, which lets an unsupervised learning model pick out important combinations of frequencies from the signal. These different combinations of frequencies arise because they have an observed spatiotemporal relationship across multiple data samples and speakers, in which their intensities correlate in time. Extracting spatiotemporal patterns between input frequencies as features instead of the overall spectrum shape can lead to new, more robust ways of encoding auditory data. (Less)
Popular Abstract
Recognizing the voices of people you know is probably something you take for granted. However, when considering the task in more detail it is astounding how the brain manages to distingush between the voices of almost everyone you've met solely based on the frequencies picked up by your inner ear. Based on recent insights into how the brain processes the sense of touch, this thesis presents a new way of approaching the problem of making sense of sound.

When you are recognizing someone's voice, you are not distingusishing one certain frequency associated with that person, but rather you recognize a combination of many different frequencies in intricate patterns over time that makes the voice sound familiar. These different frequency... (More)
Recognizing the voices of people you know is probably something you take for granted. However, when considering the task in more detail it is astounding how the brain manages to distingush between the voices of almost everyone you've met solely based on the frequencies picked up by your inner ear. Based on recent insights into how the brain processes the sense of touch, this thesis presents a new way of approaching the problem of making sense of sound.

When you are recognizing someone's voice, you are not distingusishing one certain frequency associated with that person, but rather you recognize a combination of many different frequencies in intricate patterns over time that makes the voice sound familiar. These different frequency combinations are a result of a persons vocal tract which creates resonances at certain frequencies. A persons voice is therefore much like an acoustic fingerprint, as the vocal tract encompasses many different aspects of a persons physiology, from the shape of the tounge to the width of the nostrils. A fingerprint can be photographed and relatively easily reproduced. To reverse-engineer a persons vocal tract from speech is a much more difficult task. For this reason, your voice is a good biometric that can be used for authenticating yourself when accessing private information.

Speaker recognition is the scientific field of trying to determine who is speaking. Today this is normally done with machine learning techniques, in particular artificial neural networks. In order for these networks to easier process all of the intricacies of sound, only the very general shape of the frequency representation is used. This approach is efficient but far from how our own human hearing functions.

At the most basic level, the human brain processes sensory data through the activations of vast numbers of neurons. Recent findings has found that for the sense of touch, the brain might make sense of all this incoming data by learning to distinguish recurring temporal patterns in the activations. Based on these principles this thesis presents a new biologically inspired way of dealing with sound data. (Less)
Please use this url to cite or link to this publication:
author
Andersson, Edvin LU
supervisor
organization
course
EITM01 20211
year
type
H2 - Master's Degree (Two Years)
subject
keywords
Artificial Intelligence, AI, Machine Learning, Speaker Recognition
report number
LU/LTH-EIT 2021-836
language
English
id
9059963
date added to LUP
2021-08-12 10:15:01
date last changed
2021-08-12 10:15:01
@misc{9059963,
  abstract     = {{Distinguishing between people's voices is something the human brain does naturally, using only frequencies picked up by the inner ear. The field of speaker recognition is concerned with making machines do the same thing using digitally sampled speech and data processing. The processing extracts relevant information about the speech from the high dimensional acoustic data which can help the machine understand to which speaker a speech sample belongs. Several methods exist to solve this problem, most of which are based on modelling a sample as a sequence of time frames, each representing the current frequency characteristics of the sound input. A common choice of frequency characteristics are Mel-Frequency Cepstral Coefficients (MFCC), which represent the overall shape of the frequency spectrum representation of the input during each time frame. This thesis presents a different approach, inspired by findings of how the human brain processes tactile sensory input, which lets an unsupervised learning model pick out important combinations of frequencies from the signal. These different combinations of frequencies arise because they have an observed spatiotemporal relationship across multiple data samples and speakers, in which their intensities correlate in time. Extracting spatiotemporal patterns between input frequencies as features instead of the overall spectrum shape can lead to new, more robust ways of encoding auditory data.}},
  author       = {{Andersson, Edvin}},
  language     = {{eng}},
  note         = {{Student Paper}},
  title        = {{Speaker Recognition using Biology-Inspired Feature Extraction}},
  year         = {{2021}},
}