Skip to main content

LUP Student Papers

LUND UNIVERSITY LIBRARIES

Speaker verification: Advantages and limitations of a biologically inspired feature extractor

Gajic, Maja LU (2022) EITM01 20221
Department of Electrical and Information Technology
Abstract
Speaker verification is the process of verifying the identity of a person based on voice. This process usually encompasses the following steps: The speech signal is mapped into features using a feature extractor, these features are then classified using a post processor. The most common features used in speaker verification today are STFT, MFBs, and MFCCs, that are different spectral representations of the speech signal. Recently, a biologically inspired feature extractor called the cuneate nucleus (CN) model, that outputs CN features, was created. The main goal of this Master thesis is to find an optimal ANN post processor for the CN features. Testing different models on both conventional features and CN features concluded that a CNN... (More)
Speaker verification is the process of verifying the identity of a person based on voice. This process usually encompasses the following steps: The speech signal is mapped into features using a feature extractor, these features are then classified using a post processor. The most common features used in speaker verification today are STFT, MFBs, and MFCCs, that are different spectral representations of the speech signal. Recently, a biologically inspired feature extractor called the cuneate nucleus (CN) model, that outputs CN features, was created. The main goal of this Master thesis is to find an optimal ANN post processor for the CN features. Testing different models on both conventional features and CN features concluded that a CNN model and a LSTM model were most suitable. The performance result concluded that the CN features and STFT performed well on noisy data but worse on clean data compared to the MFCCs and MFBs. A statistical analysis of the features was conducted using cross correlation, average activity and entropy. The analysis concluded that the inherent dynamical properties of the CN features and STFT make the training process of an ANN difficult, and therefore performance on clean data is poor. On the other hand these dynamical properties is what allows the features to perform well on noise. In comparison, the MFCCs and MFBs have the opposite inherent properties and this allows them to have state-of-the-art performance on clean data but poor performance on noise data. This in turn means that a conventional ANN post processor can only provide limited performance for CN features, and that other post processor methods need to be developed to reach beyond that limit. (Less)
Popular Abstract
Speaker verification is the process of identifying the person speaking. It is a widely studied field and usually deep neural networks are used. This work aims at doing speaker verification inspired by biology.

We humans can easily distinguish between different voices, and we can often identify a person speaking with only hearing their voice. Further our ability to identify a person does not notably diminish in noise conditions. This is an amazing ability to have and it is often taken for granted.

Speaker verification is the process of identifying a person by the use of their voice, today this is a hard task to accomplish especially in noise conditions. An example of speaker verification is when we use voice command on our mobile... (More)
Speaker verification is the process of identifying the person speaking. It is a widely studied field and usually deep neural networks are used. This work aims at doing speaker verification inspired by biology.

We humans can easily distinguish between different voices, and we can often identify a person speaking with only hearing their voice. Further our ability to identify a person does not notably diminish in noise conditions. This is an amazing ability to have and it is often taken for granted.

Speaker verification is the process of identifying a person by the use of their voice, today this is a hard task to accomplish especially in noise conditions. An example of speaker verification is when we use voice command on our mobile devices. Assume that two people that are sitting next to each other, both use hands free voice commands option on their phones. If someone activates their voice command by for example saying "Hey, Google!" only that persons phone should answer. Otherwise if someone uses the activation phrase in a crowded area you could potentially have multiple mobile devices answering you, which would be inconvenient. To deal with this, phone companies usually require you to say the activation phrase a couple of times before you can start using voice command. When you are repeating these phrases you are actually actively training a deep neural network to be specialised at identifying your voice. This is why the scenario above does not occur in real life.

Deep neural networks are most often used in the field of speaker verification today. And these networks function in the following way: They process frequency information of the raw speech signal, and based on that try to determine who is speaking. But, theses methods are not true to how our ear and brain picks up and process voices. In this work instead of using frequency information for the networks to process, the information output from a biologically inspired model was used. This biologically inspired model picks up frequency patterns for the deep neural network to process. (Less)
Please use this url to cite or link to this publication:
author
Gajic, Maja LU
supervisor
organization
course
EITM01 20221
year
type
H2 - Master's Degree (Two Years)
subject
keywords
Speaker verification, artificial neural network, ANN, feature extractor
report number
LU/LTH-EIT 2022-889
language
English
id
9094150
date added to LUP
2022-09-01 13:33:46
date last changed
2022-09-01 13:33:46
@misc{9094150,
  abstract     = {{Speaker verification is the process of verifying the identity of a person based on voice. This process usually encompasses the following steps: The speech signal is mapped into features using a feature extractor, these features are then classified using a post processor. The most common features used in speaker verification today are STFT, MFBs, and MFCCs, that are different spectral representations of the speech signal. Recently, a biologically inspired feature extractor called the cuneate nucleus (CN) model, that outputs CN features, was created. The main goal of this Master thesis is to find an optimal ANN post processor for the CN features. Testing different models on both conventional features and CN features concluded that a CNN model and a LSTM model were most suitable. The performance result concluded that the CN features and STFT performed well on noisy data but worse on clean data compared to the MFCCs and MFBs. A statistical analysis of the features was conducted using cross correlation, average activity and entropy. The analysis concluded that the inherent dynamical properties of the CN features and STFT make the training process of an ANN difficult, and therefore performance on clean data is poor. On the other hand these dynamical properties is what allows the features to perform well on noise. In comparison, the MFCCs and MFBs have the opposite inherent properties and this allows them to have state-of-the-art performance on clean data but poor performance on noise data. This in turn means that a conventional ANN post processor can only provide limited performance for CN features, and that other post processor methods need to be developed to reach beyond that limit.}},
  author       = {{Gajic, Maja}},
  language     = {{eng}},
  note         = {{Student Paper}},
  title        = {{Speaker verification: Advantages and limitations of a biologically inspired feature extractor}},
  year         = {{2022}},
}