Advanced

Audio representation for environmental sound classification using convolutional neural networks

Lexfors, Linus LU and Johansson, Malte LU (2018) In Master's Theses in Mathematical Sciences FMAM05 20182
Mathematics (Faculty of Engineering)
Abstract
A convolutional neural network (CNN) training framework is described and implemented. The framework is used to train and evaluate an audio classification system, focused on evaluating differences in audio representation. The dataset used is ESC-50, containing 50 different classes of audio. We used SBCNN, a promising architecture suited for embedded systems because of its relatively small size. Several models are trained and evaluated. Linear spectrograms versus mel-scaled spectrograms are compared. Differences in FFT window size and overlap when constructing these spectrograms are evaluated. In addition, models trained on downsampled training data are compared to the models using the original sample rate. In our models, mel-scaled... (More)
A convolutional neural network (CNN) training framework is described and implemented. The framework is used to train and evaluate an audio classification system, focused on evaluating differences in audio representation. The dataset used is ESC-50, containing 50 different classes of audio. We used SBCNN, a promising architecture suited for embedded systems because of its relatively small size. Several models are trained and evaluated. Linear spectrograms versus mel-scaled spectrograms are compared. Differences in FFT window size and overlap when constructing these spectrograms are evaluated. In addition, models trained on downsampled training data are compared to the models using the original sample rate. In our models, mel-scaled spectrograms outperformed linear spectrograms. The top performing model achieved a top-1 mean accuracy of 74.70\%, using mel-scaled spectrograms and a 2048 sample FFT window with 75\% overlap, compared linear spectrogram, which achieved a top-1 mean accuracy of 63.35\%. The top model was further subjected to two different inference experiments; increasingly noisy data and mixed signals. We show that the model is relatively robust against wind-noise, the accuracy remains above 60\% until the SNR between signal and wind-noise approaches 9 dB. The mixed signals test is hard to draw any strong conclusions from. (Less)
Please use this url to cite or link to this publication:
author
Lexfors, Linus LU and Johansson, Malte LU
supervisor
organization
course
FMAM05 20182
year
type
H2 - Master's Degree (Two Years)
subject
keywords
Sound classification, machine learning, cnn
publication/series
Master's Theses in Mathematical Sciences
report number
LUTFMA-3368-2018
ISSN
1404-6342
other publication id
2018:E72
language
English
id
8964345
date added to LUP
2018-12-20 14:04:47
date last changed
2018-12-20 14:04:47
@misc{8964345,
  abstract     = {A convolutional neural network (CNN) training framework is described and implemented. The framework is used to train and evaluate an audio classification system, focused on evaluating differences in audio representation. The dataset used is ESC-50, containing 50 different classes of audio. We used SBCNN, a promising architecture suited for embedded systems because of its relatively small size. Several models are trained and evaluated. Linear spectrograms versus mel-scaled spectrograms are compared. Differences in FFT window size and overlap when constructing these spectrograms are evaluated. In addition, models trained on downsampled training data are compared to the models using the original sample rate. In our models, mel-scaled spectrograms outperformed linear spectrograms. The top performing model achieved a top-1 mean accuracy of 74.70\%, using mel-scaled spectrograms and a 2048 sample FFT window with 75\% overlap, compared linear spectrogram, which achieved a top-1 mean accuracy of 63.35\%. The top model was further subjected to two different inference experiments; increasingly noisy data and mixed signals. We show that the model is relatively robust against wind-noise, the accuracy remains above 60\% until the SNR between signal and wind-noise approaches 9 dB. The mixed signals test is hard to draw any strong conclusions from.},
  author       = {Lexfors, Linus and Johansson, Malte},
  issn         = {1404-6342},
  keyword      = {Sound classification,machine learning,cnn},
  language     = {eng},
  note         = {Student Paper},
  series       = {Master's Theses in Mathematical Sciences},
  title        = {Audio representation for environmental sound classification using convolutional neural networks},
  year         = {2018},
}