colin_exjobb.pdf

Sida 1 av 68

Centre for Mathematical Sciences

Mathematical Statistics

Master’s thesis

Classification of Acoustic Scenes

Using Convolutional Neural Networks

Colin Nordin Persson

Supervised by

Prof. Andreas Jakobsson

August, 2017

Sida 2 av 68

Popular Science Summary

Using Deep Learning to Detect Parties

Deep learning is a sub field of artificial intelligence (AI) and has the

last few years emerged as something that will eventually automate

everything from cars to computer programming. Another field in

which deep learning algorithms shows promising results is in different

kinds of audio recognition, one example is to detect ongoing parties.

The words “deep learning” refers to deep artificial neural networks, which

is a kind of large and complex mathematical function that is inspired by how

the human brain is structured. These deep artificial neural networks are used

to find, or learn, complicated patterns in data.

The Malm ̈o-based startup company Minut makes a smart home sensor called

Point that is able to measure and detect events in home environments that the

home owner might want to know about.

Some owners of the Point device rent out their homes and uses the device

to make sure that there are no violations against the rental agreement. One

common part of such an agreement is that no party is allowed in the rented

home. This is why Minut sees a need for having a party detection algorithm on

their device.

The party detection algorithm needs a couple of seconds of recorded audio

to tell whether it was recorded from a party environment or not. The recorded

audio snippet is first divided into a number of frames of equal length, then the

sound wave frequencies that are present in each of the frames are calculated.

This gives a spectrogram of the audio clip, the spectrogram is an image that

shows how the frequency content of the audio snippet varies over time. The

deep neural network then looks at this image and makes a decision if it comes

from a party or not based on thousands of spectrograms it has been training

on.

Training neural networks like this means showing them thousands of spec- trograms and at the same time telling them the correct answer, i.e if they did

come from parties or not. In this way, the network will learn to distinguish the

“Party”-spectrograms from the “No party”-spectrograms.

The deep learning algorithm is able to give the correct answer 98% of the

time, this is however just on a very limited amount of test recordings. Time

will tell if party craving people will actually have to start worry about the AI

police.

Sida 3 av 68

Abstract

Minut is a startup company that builds a camera-free home monitor called Point. This thesis is about

investigating the possibilities for Point to be able to use machine learning techniques for classification of

acoustic scenes, in particular to detect if a party is ongoing in the home where Point is located. Machine

learning is a mathematical field that uses data to learn models from which one can – if successful – make

good predictions about the future. The interest in this field, and in particular a type of models called

artificial neural networks has the last few years become massive, the main reason being the recent access to

powerful hardware and lots of data, which has made these models exceptional at certain tasks. Artificial

neural networks are huge mathematical functions with millions of tunable parameters, which makes them

very flexible. By showing the networks lots of data and specifying which output that is desired, the learning

algorithm of the network is able to learn the mapping between input and output. Convolutional neural

networks is in this thesis used to classify acoustic scenes, this is done by showing the network a time- frequency representation of audio together with the correct label. One of the built networks, which we call

SlimNet, is a very small network, but yet it is able to distinguish parties from other acoustic scenes with 98

% accuracy. It is also found that the data representation of an acoustic scene does not have to be very large

for a neural network to be able to classify it correctly, which is desired since Point has hardware limitations.