Sida
1
/
68
Läser in…

Sida 1 av 68

Centre for Mathematical Sciences

Mathematical Statistics

Master’s thesis

Classification of Acoustic Scenes

Using Convolutional Neural Networks

Colin Nordin Persson

Supervised by

Prof. Andreas Jakobsson

August, 2017

Sida 1 av 68

Sida 2 av 68

Popular Science Summary

Using Deep Learning to Detect Parties

Deep learning is a sub field of artificial intelligence (AI) and has the

last few years emerged as something that will eventually automate

everything from cars to computer programming. Another field in

which deep learning algorithms shows promising results is in different

kinds of audio recognition, one example is to detect ongoing parties.

The words “deep learning” refers to deep artificial neural networks, which

is a kind of large and complex mathematical function that is inspired by how

the human brain is structured. These deep artificial neural networks are used

to find, or learn, complicated patterns in data.

The Malm ̈o-based startup company Minut makes a smart home sensor called

Point that is able to measure and detect events in home environments that the

home owner might want to know about.

Some owners of the Point device rent out their homes and uses the device

to make sure that there are no violations against the rental agreement. One

common part of such an agreement is that no party is allowed in the rented

home. This is why Minut sees a need for having a party detection algorithm on

their device.

The party detection algorithm needs a couple of seconds of recorded audio

to tell whether it was recorded from a party environment or not. The recorded

audio snippet is first divided into a number of frames of equal length, then the

sound wave frequencies that are present in each of the frames are calculated.

This gives a spectrogram of the audio clip, the spectrogram is an image that

shows how the frequency content of the audio snippet varies over time. The

deep neural network then looks at this image and makes a decision if it comes

from a party or not based on thousands of spectrograms it has been training

on.

Training neural networks like this means showing them thousands of spec- trograms and at the same time telling them the correct answer, i.e if they did

come from parties or not. In this way, the network will learn to distinguish the

“Party”-spectrograms from the “No party”-spectrograms.

The deep learning algorithm is able to give the correct answer 98% of the

time, this is however just on a very limited amount of test recordings. Time

will tell if party craving people will actually have to start worry about the AI

police.

Sida 2 av 68

Sida 3 av 68

Abstract

Minut is a startup company that builds a camera-free home monitor called Point. This thesis is about

investigating the possibilities for Point to be able to use machine learning techniques for classification of

acoustic scenes, in particular to detect if a party is ongoing in the home where Point is located. Machine

learning is a mathematical field that uses data to learn models from which one can – if successful – make

good predictions about the future. The interest in this field, and in particular a type of models called

artificial neural networks has the last few years become massive, the main reason being the recent access to

powerful hardware and lots of data, which has made these models exceptional at certain tasks. Artificial

neural networks are huge mathematical functions with millions of tunable parameters, which makes them

very flexible. By showing the networks lots of data and specifying which output that is desired, the learning

algorithm of the network is able to learn the mapping between input and output. Convolutional neural

networks is in this thesis used to classify acoustic scenes, this is done by showing the network a time- frequency representation of audio together with the correct label. One of the built networks, which we call

SlimNet, is a very small network, but yet it is able to distinguish parties from other acoustic scenes with 98

% accuracy. It is also found that the data representation of an acoustic scene does not have to be very large

for a neural network to be able to classify it correctly, which is desired since Point has hardware limitations.

Sida 3 av 68
colin_exjobb.pdf
colin_exjobb.pdf
Öppna med Copy, URL to Google Drive
Extrahera
Öppna med
Ändra
Matematik LU
lumattesalar@gmail.com
Profil på Google+–Sekretess
Mitt konto
Profil
Matematik LU
lumattesalar@gmail.com (standardinställning)
Alla dina varumärkeskonton »
Lägg till konto
Logga ut
Information
Kommentarer
Allmän information
Typ
PDF
Mått
Storlek
29 MB
Varaktighet
Plats
Ändrad den
15:59 30 aug.
Skapad den
15:59 30 aug.
Öppnad av mig den
13:47 4 sep.
Delning
Colin Nordin Persson
Ägare
Alla som har länken
Får visa
Beskrivning
Ingen beskrivning
Nedladdningsbehörighet
Läsbehöriga får ladda ned
Huvudmeny
Google-konto
Matematik LU
lumattesalar@gmail.com
Visar colin_exjobb.pdf.
Ladda ned