Skip to main content

LUP Student Papers

LUND UNIVERSITY LIBRARIES

Voice Feature Extraction Using Siamese Neural Networks for Detecting Impersonators

Almström, Marlon LU and Trân, Thi Thu Hòa LU (2022) In Master's Thesis in Mathematical Sciences MASM02 20221
Mathematical Statistics
Abstract
Voice impersonation is a technique that has often been used by criminals whose goal is to avoid being identified while committing a crime. There are, however, other interesting cases where the police confronts a suspect with an incriminating recording, and the suspect would deny being the true speaker in that recording, and claim that it belonged to an expert impersonator. In both of these cases, it would be very helpful for the police to be able to predict with high probability whether a recording belongs to the true speaker or an impersonator.

This thesis aims to use neural networks to extract the most significant features in recognizing a unique voice, and then use them to classify whether a recording belongs to a true speaker or... (More)
Voice impersonation is a technique that has often been used by criminals whose goal is to avoid being identified while committing a crime. There are, however, other interesting cases where the police confronts a suspect with an incriminating recording, and the suspect would deny being the true speaker in that recording, and claim that it belonged to an expert impersonator. In both of these cases, it would be very helpful for the police to be able to predict with high probability whether a recording belongs to the true speaker or an impersonator.

This thesis aims to use neural networks to extract the most significant features in recognizing a unique voice, and then use them to classify whether a recording belongs to a true speaker or somebody impersonating them. In order to achieve this, we first extract the raw audio features that are commonly used in speech recognition, the majority of which are spectral features, then feed these features to a Siamese Neural Network to generate an encoding that best represent a recording of a person's voice. The structure of a Siamese neural network is determined by the type of loss function being used. In this project, we compare the performances of different network structures as well as different classifiers used in classifying the speech from the encoding.

We present our approach and results on the data consisting of recordings of prominent American political figures, their impersonators, and several other individuals. (Less)
Please use this url to cite or link to this publication:
author
Almström, Marlon LU and Trân, Thi Thu Hòa LU
supervisor
organization
course
MASM02 20221
year
type
H2 - Master's Degree (Two Years)
subject
keywords
deep learning, voice impersonation, audio feature extraction, siamese neural networks
publication/series
Master's Thesis in Mathematical Sciences
report number
LUNFMS-3109-2022
ISSN
1404-6342
other publication id
2022:E35
language
English
id
9091079
date added to LUP
2022-09-14 09:23:08
date last changed
2022-09-14 09:23:08
@misc{9091079,
  abstract     = {{Voice impersonation is a technique that has often been used by criminals whose goal is to avoid being identified while committing a crime. There are, however, other interesting cases where the police confronts a suspect with an incriminating recording, and the suspect would deny being the true speaker in that recording, and claim that it belonged to an expert impersonator. In both of these cases, it would be very helpful for the police to be able to predict with high probability whether a recording belongs to the true speaker or an impersonator.

This thesis aims to use neural networks to extract the most significant features in recognizing a unique voice, and then use them to classify whether a recording belongs to a true speaker or somebody impersonating them. In order to achieve this, we first extract the raw audio features that are commonly used in speech recognition, the majority of which are spectral features, then feed these features to a Siamese Neural Network to generate an encoding that best represent a recording of a person's voice. The structure of a Siamese neural network is determined by the type of loss function being used. In this project, we compare the performances of different network structures as well as different classifiers used in classifying the speech from the encoding. 

We present our approach and results on the data consisting of recordings of prominent American political figures, their impersonators, and several other individuals.}},
  author       = {{Almström, Marlon and Trân, Thi Thu Hòa}},
  issn         = {{1404-6342}},
  language     = {{eng}},
  note         = {{Student Paper}},
  series       = {{Master's Thesis in Mathematical Sciences}},
  title        = {{Voice Feature Extraction Using Siamese Neural Networks for Detecting Impersonators}},
  year         = {{2022}},
}