Voice Feature Extraction Using Siamese Neural Networks for Detecting Impersonators
(2022) In Master's Thesis in Mathematical Sciences MASM02 20221Mathematical Statistics
- Abstract
- Voice impersonation is a technique that has often been used by criminals whose goal is to avoid being identified while committing a crime. There are, however, other interesting cases where the police confronts a suspect with an incriminating recording, and the suspect would deny being the true speaker in that recording, and claim that it belonged to an expert impersonator. In both of these cases, it would be very helpful for the police to be able to predict with high probability whether a recording belongs to the true speaker or an impersonator.
This thesis aims to use neural networks to extract the most significant features in recognizing a unique voice, and then use them to classify whether a recording belongs to a true speaker or... (More) - Voice impersonation is a technique that has often been used by criminals whose goal is to avoid being identified while committing a crime. There are, however, other interesting cases where the police confronts a suspect with an incriminating recording, and the suspect would deny being the true speaker in that recording, and claim that it belonged to an expert impersonator. In both of these cases, it would be very helpful for the police to be able to predict with high probability whether a recording belongs to the true speaker or an impersonator.
This thesis aims to use neural networks to extract the most significant features in recognizing a unique voice, and then use them to classify whether a recording belongs to a true speaker or somebody impersonating them. In order to achieve this, we first extract the raw audio features that are commonly used in speech recognition, the majority of which are spectral features, then feed these features to a Siamese Neural Network to generate an encoding that best represent a recording of a person's voice. The structure of a Siamese neural network is determined by the type of loss function being used. In this project, we compare the performances of different network structures as well as different classifiers used in classifying the speech from the encoding.
We present our approach and results on the data consisting of recordings of prominent American political figures, their impersonators, and several other individuals. (Less)
Please use this url to cite or link to this publication:
http://lup.lub.lu.se/student-papers/record/9091079
- author
- Almström, Marlon LU and Trân, Thi Thu Hòa LU
- supervisor
- organization
- course
- MASM02 20221
- year
- 2022
- type
- H2 - Master's Degree (Two Years)
- subject
- keywords
- deep learning, voice impersonation, audio feature extraction, siamese neural networks
- publication/series
- Master's Thesis in Mathematical Sciences
- report number
- LUNFMS-3109-2022
- ISSN
- 1404-6342
- other publication id
- 2022:E35
- language
- English
- id
- 9091079
- date added to LUP
- 2022-09-14 09:23:08
- date last changed
- 2022-09-14 09:23:08
@misc{9091079, abstract = {{Voice impersonation is a technique that has often been used by criminals whose goal is to avoid being identified while committing a crime. There are, however, other interesting cases where the police confronts a suspect with an incriminating recording, and the suspect would deny being the true speaker in that recording, and claim that it belonged to an expert impersonator. In both of these cases, it would be very helpful for the police to be able to predict with high probability whether a recording belongs to the true speaker or an impersonator. This thesis aims to use neural networks to extract the most significant features in recognizing a unique voice, and then use them to classify whether a recording belongs to a true speaker or somebody impersonating them. In order to achieve this, we first extract the raw audio features that are commonly used in speech recognition, the majority of which are spectral features, then feed these features to a Siamese Neural Network to generate an encoding that best represent a recording of a person's voice. The structure of a Siamese neural network is determined by the type of loss function being used. In this project, we compare the performances of different network structures as well as different classifiers used in classifying the speech from the encoding. We present our approach and results on the data consisting of recordings of prominent American political figures, their impersonators, and several other individuals.}}, author = {{Almström, Marlon and Trân, Thi Thu Hòa}}, issn = {{1404-6342}}, language = {{eng}}, note = {{Student Paper}}, series = {{Master's Thesis in Mathematical Sciences}}, title = {{Voice Feature Extraction Using Siamese Neural Networks for Detecting Impersonators}}, year = {{2022}}, }