Skip to main content

Lund University Publications

LUND UNIVERSITY LIBRARIES

Machine Learning for Perception and Localization: Efficient and Invariant Methods

Berg, Axel LU orcid (2024) In Doctoral Thesis in Mathematical Sciences 2024(5).
Abstract
This thesis covers a set of methods related to machine perception and localization, which are two important building blocks of artificial intelligence. In Paper I, we explore the concept of regression via classification (RvC), which is often used for perception tasks where the target variable is either ordinal or when the distance metric of the target space is not well-suited as an objective function. However, it is not clear how the discretization of the target variable ought to be done. To this end, we introduce the concept of label diversity and propose a new loss function based on concepts from ensemble learning that can be used for both ordinal and continuous targets.

Papers II and III deal with applying... (More)
This thesis covers a set of methods related to machine perception and localization, which are two important building blocks of artificial intelligence. In Paper I, we explore the concept of regression via classification (RvC), which is often used for perception tasks where the target variable is either ordinal or when the distance metric of the target space is not well-suited as an objective function. However, it is not clear how the discretization of the target variable ought to be done. To this end, we introduce the concept of label diversity and propose a new loss function based on concepts from ensemble learning that can be used for both ordinal and continuous targets.

Papers II and III deal with applying the concept of self-attention to different data domains. In Paper II we focus on point clouds, which are modeled as unordered sets in 3D space. Although applying self-attention to sets is straightforward, we find that this mechanism in itself is not enough to improve feature learning. Instead, we propose a hierarchical approach inspired by graph neural networks, where self-attention is applied to both patches of points, and to points within the patches. This results in improved predictive performance and reduced computational cost, while preserving invariance to permutations of points in the set.

In Paper III we explore the use of self-attention for auditory perception. Using a simple Transformer architecture, we achieve state-of-the art performance for speech classification. However, deploying speech recognition models in real world scenarios often involves making trade-offs between predictive performance and computational costs. In Paper IV, we therefore explore floating point quantization of neural networks in the context of federated learning and propose a new method that allows training to be performed on low-precision hardware. More specifically, we propose a method for quantization aware training and server-to-device communication in 8-bit floating point. This allows for a significant reduction in the amount of data that needs to be communicated during the training process. Building upon the results in Paper III, we also show that our Transformer-based model can be quantized and trained in a realistic federated speech recognition setup and still achieve good performance.

Papers V, VI and VII also deal with auditory perception, but from the localization point of view. This involves processing signals from microphone arrays and extracting spatial cues that enable the system to infer the location of the sound source. One such cue is the time difference of arrival (TDOA), which is estimated by correlating signals from different pairs of microphones. However, measuring TDOA in adverse acoustical conditions is difficult, which motivates the use of machine learning for this task. In Paper V, we propose a learning-based extension of a classical method for TDOA estimation that improves prediction accuracy, while simultaneously preserving some of the properties of the classical method. This is achieved by using a model architecture that is equivariant to time shifts together with an RvC training objective.

TDOA estimates are often used as input to sound source localization (SSL) systems. In Paper VI, we extend the method from Paper V to predict TDOA's from multiple overlapping sound sources and show that this is a good pre-training task for extracting correlation features to an SSL system, with improved localization performance compared to popular handcrafted input features. 

In Paper VII, we instead focus on a single sound source, but with variable number of microphones in the array. Most machine learning methods for SSL are trained using a specific microphone array setup and will not work if a microphone is turned off or moved to a different position. We solve this problem by modeling pairs of audio recordings and microphone coordinates as nodes in a multi-modal graph. This enables the use of an attention-based autoencoder model that infers the location of the sound source using both microphone coordinates, i.e. a set of points in 3D space, and audio features, while preserving invariance to permutations of microphones. Furthermore, we address variants of the problem where data is partially missing, such as signals from a microphone at an unknown location.
(Less)
Please use this url to cite or link to this publication:
author
supervisor
opponent
  • Assoc. Prof. Rindom Jensen, Jesper, Aalborg University, Denmark.
organization
publishing date
type
Thesis
publication status
published
subject
keywords
neural networks, deep learning, machine perception, ordinal regression, shape recognition, transformer, audio classification, quantization;, sound source localization
in
Doctoral Thesis in Mathematical Sciences
volume
2024
issue
5
pages
260 pages
publisher
Centre for Mathematical Sciences, Lund University
defense location
Lecture Hall MH:Hörmander, Centre of Mathematical Sciences, Märkesbacken 4, Faculty of Engineering LTH, Lund University, Lund. The dissertation will be live streamed, but part of the premises is to be excluded from the live stream. Zoom: https://lu-se.zoom.us/j/66823653847
defense date
2025-02-07 13:15:00
ISSN
1404­0034
ISBN
978-91-8104-314-3
978-91-8104-315-0
project
Deep Learning for Simultaneous Localization and Mapping
WASP: Wallenberg AI, Autonomous Systems and Software Program at Lund University
language
English
LU publication?
yes
id
5aab1153-8262-437e-8dce-7fef5fa9f606
date added to LUP
2024-11-15 15:23:46
date last changed
2025-01-11 03:13:25
@phdthesis{5aab1153-8262-437e-8dce-7fef5fa9f606,
  abstract     = {{<div>This thesis covers a set of methods related to machine perception and localization, which are two important building blocks of artificial intelligence. In Paper I, we explore the concept of regression via classification (RvC), which is often used for perception tasks where the target variable is either ordinal or when the distance metric of the target space is not well-suited as an objective function. However, it is not clear how the discretization of the target variable ought to be done. To this end, we introduce the concept of label diversity and propose a new loss function based on concepts from ensemble learning that can be used for both ordinal and continuous targets.</div><div><br/></div><div>Papers II and III deal with applying the concept of self-attention to different data domains. In Paper II we focus on point clouds, which are modeled as unordered sets in 3D space. Although applying self-attention to sets is straightforward, we find that this mechanism in itself is not enough to improve feature learning. Instead, we propose a hierarchical approach inspired by graph neural networks, where self-attention is applied to both patches of points, and to points within the patches. This results in improved predictive performance and reduced computational cost, while preserving invariance to permutations of points in the set.</div><div><br/></div><div>In Paper III we explore the use of self-attention for auditory perception. Using a simple Transformer architecture, we achieve state-of-the art performance for speech classification. However, deploying speech recognition models in real world scenarios often involves making trade-offs between predictive performance and computational costs. In Paper IV, we therefore explore floating point quantization of neural networks in the context of federated learning and propose a new method that allows training to be performed on low-precision hardware. More specifically, we propose a method for quantization aware training and server-to-device communication in 8-bit floating point. This allows for a significant reduction in the amount of data that needs to be communicated during the training process. Building upon the results in Paper III, we also show that our Transformer-based model can be quantized and trained in a realistic federated speech recognition setup and still achieve good performance.</div><div><br/></div><div>Papers V, VI and VII also deal with auditory perception, but from the localization point of view. This involves processing signals from microphone arrays and extracting spatial cues that enable the system to infer the location of the sound source. One such cue is the time difference of arrival (TDOA), which is estimated by correlating signals from different pairs of microphones. However, measuring TDOA in adverse acoustical conditions is difficult, which motivates the use of machine learning for this task. In Paper V, we propose a learning-based extension of a classical method for TDOA estimation that improves prediction accuracy, while simultaneously preserving some of the properties of the classical method. This is achieved by using a model architecture that is equivariant to time shifts together with an RvC training objective.</div><div><br/></div><div>TDOA estimates are often used as input to sound source localization (SSL) systems. In Paper VI, we extend the method from Paper V to predict TDOA's from multiple overlapping sound sources and show that this is a good pre-training task for extracting correlation features to an SSL system, with improved localization performance compared to popular handcrafted input features. </div><div><br/></div><div>In Paper VII, we instead focus on a single sound source, but with variable number of microphones in the array. Most machine learning methods for SSL are trained using a specific microphone array setup and will not work if a microphone is turned off or moved to a different position. We solve this problem by modeling pairs of audio recordings and microphone coordinates as nodes in a multi-modal graph. This enables the use of an attention-based autoencoder model that infers the location of the sound source using both microphone coordinates, i.e. a set of points in 3D space, and audio features, while preserving invariance to permutations of microphones. Furthermore, we address variants of the problem where data is partially missing, such as signals from a microphone at an unknown location.</div>}},
  author       = {{Berg, Axel}},
  isbn         = {{978-91-8104-314-3}},
  issn         = {{1404­0034}},
  keywords     = {{neural networks; deep learning; machine perception; ordinal regression; shape recognition; transformer; audio classification; quantization;; sound source localization}},
  language     = {{eng}},
  number       = {{5}},
  publisher    = {{Centre for Mathematical Sciences, Lund University}},
  school       = {{Lund University}},
  series       = {{Doctoral Thesis in Mathematical Sciences}},
  title        = {{Machine Learning for Perception and Localization: Efficient and Invariant Methods}},
  url          = {{https://lup.lub.lu.se/search/files/202756396/thesis.pdf}},
  volume       = {{2024}},
  year         = {{2024}},
}