Voice Feature Extraction Using Siamese Neural Networks for Detecting Impersonators

Almström, Marlon; Trân, Thi Thu Hòa

Voice Feature Extraction Using Siamese Neural Networks for Detecting Impersonators

Mark

Almström, Marlon ^LU and Trân, Thi Thu Hòa ^LU (2022) In Master's Thesis in Mathematical Sciences MASM02 20221
Mathematical Statistics

Abstract: Voice impersonation is a technique that has often been used by criminals whose goal is to avoid being identified while committing a crime. There are, however, other interesting cases where the police confronts a suspect with an incriminating recording, and the suspect would deny being the true speaker in that recording, and claim that it belonged to an expert impersonator. In both of these cases, it would be very helpful for the police to be able to predict with high probability whether a recording belongs to the true speaker or an impersonator.

This thesis aims to use neural networks to extract the most significant features in recognizing a unique voice, and then use them to classify whether a recording belongs to a true speaker or... (More); Voice impersonation is a technique that has often been used by criminals whose goal is to avoid being identified while committing a crime. There are, however, other interesting cases where the police confronts a suspect with an incriminating recording, and the suspect would deny being the true speaker in that recording, and claim that it belonged to an expert impersonator. In both of these cases, it would be very helpful for the police to be able to predict with high probability whether a recording belongs to the true speaker or an impersonator.

This thesis aims to use neural networks to extract the most significant features in recognizing a unique voice, and then use them to classify whether a recording belongs to a true speaker or somebody impersonating them. In order to achieve this, we first extract the raw audio features that are commonly used in speech recognition, the majority of which are spectral features, then feed these features to a Siamese Neural Network to generate an encoding that best represent a recording of a person's voice. The structure of a Siamese neural network is determined by the type of loss function being used. In this project, we compare the performances of different network structures as well as different classifiers used in classifying the speech from the encoding.

We present our approach and results on the data consisting of recordings of prominent American political figures, their impersonators, and several other individuals. (Less)

Please use this url to cite or link to this publication: https://lup.lub.lu.se/student-papers/record/9091079

author

Almström, Marlon ^LU and Trân, Thi Thu Hòa ^LU

supervisor

Andreas Jakobsson ^LU

organization

Mathematical Statistics

course

MASM02 20221

year

2022

type

H2 - Master's Degree (Two Years)

subject

Mathematics and Statistics

keywords

deep learning, voice impersonation, audio feature extraction, siamese neural networks

publication/series

Master's Thesis in Mathematical Sciences

report number

LUNFMS-3109-2022

ISSN

1404-6342

other publication id

2022:E35

language

English

id

9091079

date added to LUP

2022-09-14 09:23:08

date last changed

2022-09-14 09:23:08

@misc{9091079,
  abstract     = {{Voice impersonation is a technique that has often been used by criminals whose goal is to avoid being identified while committing a crime. There are, however, other interesting cases where the police confronts a suspect with an incriminating recording, and the suspect would deny being the true speaker in that recording, and claim that it belonged to an expert impersonator. In both of these cases, it would be very helpful for the police to be able to predict with high probability whether a recording belongs to the true speaker or an impersonator.

This thesis aims to use neural networks to extract the most significant features in recognizing a unique voice, and then use them to classify whether a recording belongs to a true speaker or somebody impersonating them. In order to achieve this, we first extract the raw audio features that are commonly used in speech recognition, the majority of which are spectral features, then feed these features to a Siamese Neural Network to generate an encoding that best represent a recording of a person's voice. The structure of a Siamese neural network is determined by the type of loss function being used. In this project, we compare the performances of different network structures as well as different classifiers used in classifying the speech from the encoding. 

We present our approach and results on the data consisting of recordings of prominent American political figures, their impersonators, and several other individuals.}},
  author       = {{Almström, Marlon and Trân, Thi Thu Hòa}},
  issn         = {{1404-6342}},
  language     = {{eng}},
  note         = {{Student Paper}},
  series       = {{Master's Thesis in Mathematical Sciences}},
  title        = {{Voice Feature Extraction Using Siamese Neural Networks for Detecting Impersonators}},
  year         = {{2022}},
}

LUP Student Papers

LUND UNIVERSITY LIBRARIES

Voice Feature Extraction Using Siamese Neural Networks for Detecting Impersonators