Advanced

Classification of Acoustic Scenes Using Convolutional Neural Networks

Nordin Persson, Colin (2017) FMS820 20171
Mathematical Statistics
Abstract
Minut is a startup company that builds a camera-free home monitor called Point. This thesis is about
investigating the possibilities for Point to be able to use machine learning techniques for classification of
acoustic scenes, in particular to detect if a party is ongoing in the home where Point is located. Machine
learning is a mathematical field that uses data to learn models from which one can – if successful – make
good predictions about the future. The interest in this field, and in particular a type of models called
artificial neural networks has the last few years become massive, the main reason being the recent access to
powerful hardware and lots of data, which has made these models exceptional at certain tasks. Artificial
... (More)
Minut is a startup company that builds a camera-free home monitor called Point. This thesis is about
investigating the possibilities for Point to be able to use machine learning techniques for classification of
acoustic scenes, in particular to detect if a party is ongoing in the home where Point is located. Machine
learning is a mathematical field that uses data to learn models from which one can – if successful – make
good predictions about the future. The interest in this field, and in particular a type of models called
artificial neural networks has the last few years become massive, the main reason being the recent access to
powerful hardware and lots of data, which has made these models exceptional at certain tasks. Artificial
neural networks are huge mathematical functions with millions of tunable parameters, which makes them
very flexible. By showing the networks lots of data and specifying which output that is desired, the learning
algorithm of the network is able to learn the mapping between input and output. Convolutional neural

networks is in this thesis used to classify acoustic scenes, this is done by showing the network a time-
frequency representation of audio together with the correct label. One of the built networks, which we call

SlimNet, is a very small network, but yet it is able to distinguish parties from other acoustic scenes with 98
% accuracy. It is also found that the data representation of an acoustic scene does not have to be very large
for a neural network to be able to classify it correctly, which is desired since Point has hardware limitations. (Less)
Popular Abstract
Using Deep Learning to Detect Parties
Deep learning is a sub field of artificial intelligence (AI) and has the
last few years emerged as something that will eventually automate
everything from cars to computer programming. Another field in
which deep learning algorithms shows promising results is in different
kinds of audio recognition, one example is to detect ongoing parties.
The words “deep learning” refers to deep artificial neural networks, which
is a kind of large and complex mathematical function that is inspired by how
the human brain is structured. These deep artificial neural networks are used
to find, or learn, complicated patterns in data.
The Malm ̈o-based startup company Minut makes a smart home sensor called
Point... (More)
Using Deep Learning to Detect Parties
Deep learning is a sub field of artificial intelligence (AI) and has the
last few years emerged as something that will eventually automate
everything from cars to computer programming. Another field in
which deep learning algorithms shows promising results is in different
kinds of audio recognition, one example is to detect ongoing parties.
The words “deep learning” refers to deep artificial neural networks, which
is a kind of large and complex mathematical function that is inspired by how
the human brain is structured. These deep artificial neural networks are used
to find, or learn, complicated patterns in data.
The Malm ̈o-based startup company Minut makes a smart home sensor called
Point that is able to measure and detect events in home environments that the
home owner might want to know about.
Some owners of the Point device rent out their homes and uses the device
to make sure that there are no violations against the rental agreement. One
common part of such an agreement is that no party is allowed in the rented
home. This is why Minut sees a need for having a party detection algorithm on
their device.
The party detection algorithm needs a couple of seconds of recorded audio
to tell whether it was recorded from a party environment or not. The recorded
audio snippet is first divided into a number of frames of equal length, then the
sound wave frequencies that are present in each of the frames are calculated.
This gives a spectrogram of the audio clip, the spectrogram is an image that
shows how the frequency content of the audio snippet varies over time. The
deep neural network then looks at this image and makes a decision if it comes
from a party or not based on thousands of spectrograms it has been training
on.

Training neural networks like this means showing them thousands of spec-
trograms and at the same time telling them the correct answer, i.e if they did

come from parties or not. In this way, the network will learn to distinguish the
“Party”-spectrograms from the “No party”-spectrograms.
The deep learning algorithm is able to give the correct answer 98% of the
time, this is however just on a very limited amount of test recordings. Time
will tell if party craving people will actually have to start worry about the AI
police. (Less)
Please use this url to cite or link to this publication:
author
Nordin Persson, Colin
supervisor
organization
course
FMS820 20171
year
type
H2 - Master's Degree (Two Years)
subject
language
English
id
8924630
date added to LUP
2017-09-04 14:07:03
date last changed
2017-09-04 14:07:03
@misc{8924630,
  abstract     = {Minut is a startup company that builds a camera-free home monitor called Point. This thesis is about
investigating the possibilities for Point to be able to use machine learning techniques for classification of
acoustic scenes, in particular to detect if a party is ongoing in the home where Point is located. Machine
learning is a mathematical field that uses data to learn models from which one can – if successful – make
good predictions about the future. The interest in this field, and in particular a type of models called
artificial neural networks has the last few years become massive, the main reason being the recent access to
powerful hardware and lots of data, which has made these models exceptional at certain tasks. Artificial
neural networks are huge mathematical functions with millions of tunable parameters, which makes them
very flexible. By showing the networks lots of data and specifying which output that is desired, the learning
algorithm of the network is able to learn the mapping between input and output. Convolutional neural

networks is in this thesis used to classify acoustic scenes, this is done by showing the network a time-
frequency representation of audio together with the correct label. One of the built networks, which we call

SlimNet, is a very small network, but yet it is able to distinguish parties from other acoustic scenes with 98
% accuracy. It is also found that the data representation of an acoustic scene does not have to be very large
for a neural network to be able to classify it correctly, which is desired since Point has hardware limitations.},
  author       = {Nordin Persson, Colin},
  language     = {eng},
  note         = {Student Paper},
  title        = {Classification of Acoustic Scenes Using Convolutional Neural Networks},
  year         = {2017},
}