Machine Learning for Perception and Localization: Efficient and Invariant Methods

Berg, Axel

Machine Learning for Perception and Localization: Efficient and Invariant Methods

Mark

Berg, Axel ^LU

(2024) In Doctoral Thesis in Mathematical Sciences 2024(5).

Abstract: This thesis covers a set of methods related to machine perception and localization, which are two important building blocks of artificial intelligence. In Paper I, we explore the concept of regression via classification (RvC), which is often used for perception tasks where the target variable is either ordinal or when the distance metric of the target space is not well-suited as an objective function. However, it is not clear how the discretization of the target variable ought to be done. To this end, we introduce the concept of label diversity and propose a new loss function based on concepts from ensemble learning that can be used for both ordinal and continuous targets.

Papers II and III deal with applying... (More); This thesis covers a set of methods related to machine perception and localization, which are two important building blocks of artificial intelligence. In Paper I, we explore the concept of regression via classification (RvC), which is often used for perception tasks where the target variable is either ordinal or when the distance metric of the target space is not well-suited as an objective function. However, it is not clear how the discretization of the target variable ought to be done. To this end, we introduce the concept of label diversity and propose a new loss function based on concepts from ensemble learning that can be used for both ordinal and continuous targets.

Papers II and III deal with applying the concept of self-attention to different data domains. In Paper II we focus on point clouds, which are modeled as unordered sets in 3D space. Although applying self-attention to sets is straightforward, we find that this mechanism in itself is not enough to improve feature learning. Instead, we propose a hierarchical approach inspired by graph neural networks, where self-attention is applied to both patches of points, and to points within the patches. This results in improved predictive performance and reduced computational cost, while preserving invariance to permutations of points in the set.

In Paper III we explore the use of self-attention for auditory perception. Using a simple Transformer architecture, we achieve state-of-the art performance for speech classification. However, deploying speech recognition models in real world scenarios often involves making trade-offs between predictive performance and computational costs. In Paper IV, we therefore explore floating point quantization of neural networks in the context of federated learning and propose a new method that allows training to be performed on low-precision hardware. More specifically, we propose a method for quantization aware training and server-to-device communication in 8-bit floating point. This allows for a significant reduction in the amount of data that needs to be communicated during the training process. Building upon the results in Paper III, we also show that our Transformer-based model can be quantized and trained in a realistic federated speech recognition setup and still achieve good performance.

Papers V, VI and VII also deal with auditory perception, but from the localization point of view. This involves processing signals from microphone arrays and extracting spatial cues that enable the system to infer the location of the sound source. One such cue is the time difference of arrival (TDOA), which is estimated by correlating signals from different pairs of microphones. However, measuring TDOA in adverse acoustical conditions is difficult, which motivates the use of machine learning for this task. In Paper V, we propose a learning-based extension of a classical method for TDOA estimation that improves prediction accuracy, while simultaneously preserving some of the properties of the classical method. This is achieved by using a model architecture that is equivariant to time shifts together with an RvC training objective.

TDOA estimates are often used as input to sound source localization (SSL) systems. In Paper VI, we extend the method from Paper V to predict TDOA's from multiple overlapping sound sources and show that this is a good pre-training task for extracting correlation features to an SSL system, with improved localization performance compared to popular handcrafted input features.

In Paper VII, we instead focus on a single sound source, but with variable number of microphones in the array. Most machine learning methods for SSL are trained using a specific microphone array setup and will not work if a microphone is turned off or moved to a different position. We solve this problem by modeling pairs of audio recordings and microphone coordinates as nodes in a multi-modal graph. This enables the use of an attention-based autoencoder model that infers the location of the sound source using both microphone coordinates, i.e. a set of points in 3D space, and audio features, while preserving invariance to permutations of microphones. Furthermore, we address variants of the problem where data is partially missing, such as signals from a microphone at an unknown location.
(Less)

Please use this url to cite or link to this publication: https://lup.lub.lu.se/record/5aab1153-8262-437e-8dce-7fef5fa9f606

author

Berg, Axel ^LU

supervisor

opponent

Assoc. Prof. Rindom Jensen, Jesper, Aalborg University, Denmark.

organization

Computer Vision and Machine Learning (research group)

publishing date

2024-12

type

Thesis

publication status

published

subject

keywords

neural networks, deep learning, machine perception, ordinal regression, shape recognition, transformer, audio classification, quantization;, sound source localization

in

Doctoral Thesis in Mathematical Sciences

volume

2024

issue

5

pages

260 pages

publisher

Centre for Mathematical Sciences, Lund University

defense location

Lecture Hall MH:Hörmander, Centre of Mathematical Sciences, Märkesbacken 4, Faculty of Engineering LTH, Lund University, Lund. The dissertation will be live streamed, but part of the premises is to be excluded from the live stream. Zoom: https://lu-se.zoom.us/j/66823653847

defense date

2025-02-07 13:15:00

ISSN

14040034

ISBN

978-91-8104-314-3

978-91-8104-315-0

project

Deep Learning for Simultaneous Localization and Mapping

WASP: Wallenberg AI, Autonomous Systems and Software Program at Lund University

language

English

LU publication?

yes

id

5aab1153-8262-437e-8dce-7fef5fa9f606

date added to LUP

2024-11-15 15:23:46

date last changed

2025-04-04 15:37:53

@phdthesis{5aab1153-8262-437e-8dce-7fef5fa9f606,
  abstract     = {{<div>This thesis covers a set of methods related to machine perception and localization, which are two important building blocks of artificial intelligence. In Paper I, we explore the concept of regression via classification (RvC), which is often used for perception tasks where the target variable is either ordinal or when the distance metric of the target space is not well-suited as an objective function. However, it is not clear how the discretization of the target variable ought to be done. To this end, we introduce the concept of label diversity and propose a new loss function based on concepts from ensemble learning that can be used for both ordinal and continuous targets.</div><div><br/></div><div>Papers II and III deal with applying the concept of self-attention to different data domains. In Paper II we focus on point clouds, which are modeled as unordered sets in 3D space. Although applying self-attention to sets is straightforward, we find that this mechanism in itself is not enough to improve feature learning. Instead, we propose a hierarchical approach inspired by graph neural networks, where self-attention is applied to both patches of points, and to points within the patches. This results in improved predictive performance and reduced computational cost, while preserving invariance to permutations of points in the set.</div><div><br/></div><div>In Paper III we explore the use of self-attention for auditory perception. Using a simple Transformer architecture, we achieve state-of-the art performance for speech classification. However, deploying speech recognition models in real world scenarios often involves making trade-offs between predictive performance and computational costs. In Paper IV, we therefore explore floating point quantization of neural networks in the context of federated learning and propose a new method that allows training to be performed on low-precision hardware. More specifically, we propose a method for quantization aware training and server-to-device communication in 8-bit floating point. This allows for a significant reduction in the amount of data that needs to be communicated during the training process. Building upon the results in Paper III, we also show that our Transformer-based model can be quantized and trained in a realistic federated speech recognition setup and still achieve good performance.</div><div><br/></div><div>Papers V, VI and VII also deal with auditory perception, but from the localization point of view. This involves processing signals from microphone arrays and extracting spatial cues that enable the system to infer the location of the sound source. One such cue is the time difference of arrival (TDOA), which is estimated by correlating signals from different pairs of microphones. However, measuring TDOA in adverse acoustical conditions is difficult, which motivates the use of machine learning for this task. In Paper V, we propose a learning-based extension of a classical method for TDOA estimation that improves prediction accuracy, while simultaneously preserving some of the properties of the classical method. This is achieved by using a model architecture that is equivariant to time shifts together with an RvC training objective.</div><div><br/></div><div>TDOA estimates are often used as input to sound source localization (SSL) systems. In Paper VI, we extend the method from Paper V to predict TDOA's from multiple overlapping sound sources and show that this is a good pre-training task for extracting correlation features to an SSL system, with improved localization performance compared to popular handcrafted input features. </div><div><br/></div><div>In Paper VII, we instead focus on a single sound source, but with variable number of microphones in the array. Most machine learning methods for SSL are trained using a specific microphone array setup and will not work if a microphone is turned off or moved to a different position. We solve this problem by modeling pairs of audio recordings and microphone coordinates as nodes in a multi-modal graph. This enables the use of an attention-based autoencoder model that infers the location of the sound source using both microphone coordinates, i.e. a set of points in 3D space, and audio features, while preserving invariance to permutations of microphones. Furthermore, we address variants of the problem where data is partially missing, such as signals from a microphone at an unknown location.</div>}},
  author       = {{Berg, Axel}},
  isbn         = {{978-91-8104-314-3}},
  issn         = {{14040034}},
  keywords     = {{neural networks; deep learning; machine perception; ordinal regression; shape recognition; transformer; audio classification; quantization;; sound source localization}},
  language     = {{eng}},
  number       = {{5}},
  publisher    = {{Centre for Mathematical Sciences, Lund University}},
  school       = {{Lund University}},
  series       = {{Doctoral Thesis in Mathematical Sciences}},
  title        = {{Machine Learning for Perception and Localization: Efficient and Invariant Methods}},
  url          = {{https://lup.lub.lu.se/search/files/202756396/thesis.pdf}},
  volume       = {{2024}},
  year         = {{2024}},
}

Lund University Publications

LUND UNIVERSITY LIBRARIES

Machine Learning for Perception and Localization: Efficient and Invariant Methods