Audio representation for environmental sound classification using convolutional neural networks

Lexfors, Linus; Johansson, Malte

Audio representation for environmental sound classification using convolutional neural networks

Mark

Lexfors, Linus ^LU and Johansson, Malte ^LU (2018) In Master's Theses in Mathematical Sciences FMAM05 20182
Mathematics (Faculty of Engineering)

Abstract: A convolutional neural network (CNN) training framework is described and implemented. The framework is used to train and evaluate an audio classification system, focused on evaluating differences in audio representation. The dataset used is ESC-50, containing 50 different classes of audio. We used SBCNN, a promising architecture suited for embedded systems because of its relatively small size. Several models are trained and evaluated. Linear spectrograms versus mel-scaled spectrograms are compared. Differences in FFT window size and overlap when constructing these spectrograms are evaluated. In addition, models trained on downsampled training data are compared to the models using the original sample rate. In our models, mel-scaled... (More); A convolutional neural network (CNN) training framework is described and implemented. The framework is used to train and evaluate an audio classification system, focused on evaluating differences in audio representation. The dataset used is ESC-50, containing 50 different classes of audio. We used SBCNN, a promising architecture suited for embedded systems because of its relatively small size. Several models are trained and evaluated. Linear spectrograms versus mel-scaled spectrograms are compared. Differences in FFT window size and overlap when constructing these spectrograms are evaluated. In addition, models trained on downsampled training data are compared to the models using the original sample rate. In our models, mel-scaled spectrograms outperformed linear spectrograms. The top performing model achieved a top-1 mean accuracy of 74.70\%, using mel-scaled spectrograms and a 2048 sample FFT window with 75\% overlap, compared linear spectrogram, which achieved a top-1 mean accuracy of 63.35\%. The top model was further subjected to two different inference experiments; increasingly noisy data and mixed signals. We show that the model is relatively robust against wind-noise, the accuracy remains above 60\% until the SNR between signal and wind-noise approaches 9 dB. The mixed signals test is hard to draw any strong conclusions from. (Less)

Please use this url to cite or link to this publication: http://lup.lub.lu.se/student-papers/record/8964345

author

Lexfors, Linus ^LU and Johansson, Malte ^LU

supervisor

Karl Åström ^LU

organization

Mathematics (Faculty of Engineering)

course

FMAM05 20182

year

2018

type

H2 - Master's Degree (Two Years)

subject

Mathematics and Statistics

keywords

Sound classification, machine learning, cnn

publication/series

Master's Theses in Mathematical Sciences

report number

LUTFMA-3368-2018

ISSN

1404-6342

other publication id

2018:E72

language

English

id

8964345

date added to LUP

2018-12-20 14:04:47

date last changed

2018-12-20 14:04:47

@misc{8964345,
  abstract     = {{A convolutional neural network (CNN) training framework is described and implemented. The framework is used to train and evaluate an audio classification system, focused on evaluating differences in audio representation. The dataset used is ESC-50, containing 50 different classes of audio. We used SBCNN, a promising architecture suited for embedded systems because of its relatively small size. Several models are trained and evaluated. Linear spectrograms versus mel-scaled spectrograms are compared. Differences in FFT window size and overlap when constructing these spectrograms are evaluated. In addition, models trained on downsampled training data are compared to the models using the original sample rate. In our models, mel-scaled spectrograms outperformed linear spectrograms. The top performing model achieved a top-1 mean accuracy of 74.70\%, using mel-scaled spectrograms and a 2048 sample FFT window with 75\% overlap, compared linear spectrogram, which achieved a top-1 mean accuracy of 63.35\%. The top model was further subjected to two different inference experiments; increasingly noisy data and mixed signals. We show that the model is relatively robust against wind-noise, the accuracy remains above 60\% until the SNR between signal and wind-noise approaches 9 dB. The mixed signals test is hard to draw any strong conclusions from.}},
  author       = {{Lexfors, Linus and Johansson, Malte}},
  issn         = {{1404-6342}},
  language     = {{eng}},
  note         = {{Student Paper}},
  series       = {{Master's Theses in Mathematical Sciences}},
  title        = {{Audio representation for environmental sound classification using convolutional neural networks}},
  year         = {{2018}},
}

LUP Student Papers

LUND UNIVERSITY LIBRARIES

Audio representation for environmental sound classification using convolutional neural networks