wav2pos: Sound Source Localization using Masked Autoencoders

Berg, Axel; Gulin, Jens; O'Connor, Mark; Zhou, Chuteng; Åström, Kalle; Oskarsson, Magnus

wav2pos: Sound Source Localization using Masked Autoencoders

Mark

Berg, Axel ^LU

; Gulin, Jens ^LU

; O'Connor, Mark ; Zhou, Chuteng ; Åström, Kalle ^LU

and Oskarsson, Magnus ^LU

(2024) 2024 14th International Conference on Indoor Positioning and Indoor Navigation (IPIN) In International Conference on Indoor Positioning and Indoor Navigation (IPIN) p.1-8

Abstract: We present a novel approach to the 3D sound source localization task for distributed ad-hoc microphone arrays by formulating it as a set-to-set regression problem. By training a multi-modal masked autoencoder model that operates on audio recordings and microphone coordinates, we show that such a formulation allows for accurate localization of the sound source, by reconstructing coordinates masked in the input. Our approach is flexible in the sense that a single model can be used with an arbitrary number of microphones, even when a subset of audio recordings and microphone coordinates are missing. We test our method on simulated and real-world recordings of music and speech in indoor environments, and demonstrate competitive performance... (More); We present a novel approach to the 3D sound source localization task for distributed ad-hoc microphone arrays by formulating it as a set-to-set regression problem. By training a multi-modal masked autoencoder model that operates on audio recordings and microphone coordinates, we show that such a formulation allows for accurate localization of the sound source, by reconstructing coordinates masked in the input. Our approach is flexible in the sense that a single model can be used with an arbitrary number of microphones, even when a subset of audio recordings and microphone coordinates are missing. We test our method on simulated and real-world recordings of music and speech in indoor environments, and demonstrate competitive performance compared to both classical and other learning based localization methods. (Less)

Please use this url to cite or link to this publication: https://lup.lub.lu.se/record/4b3af846-795b-4ac7-956f-6aded73bc4e1

author

Berg, Axel ^LU

; Gulin, Jens ^LU

; O'Connor, Mark ; Zhou, Chuteng ; Åström, Kalle ^LU

and Oskarsson, Magnus ^LU

organization

publishing date

2024

type

Chapter in Book/Report/Conference proceeding

publication status

published

subject

Signal Processing

keywords

sound source localization, masked autoencoders, transformers

host publication

2024 14th International Conference on Indoor Positioning and Indoor Navigation (IPIN)

series title

International Conference on Indoor Positioning and Indoor Navigation (IPIN)

pages

8 pages

publisher

IEEE - Institute of Electrical and Electronics Engineers Inc.

conference name

2024 14th International Conference on Indoor Positioning and Indoor Navigation (IPIN)

conference location

Hong Kong

conference dates

2024-10-14 - 2024-10-17

external identifiers

scopus:85216392587

ISSN

2471-917X

2162-7347

ISBN

979-8-3503-6641-9

979-8-3503-6640-2

DOI

10.1109/IPIN62893.2024.10786105

project

Deep Learning for Simultaneous Localization and Mapping

language

English

LU publication?

yes

id

4b3af846-795b-4ac7-956f-6aded73bc4e1

alternative location

https://arxiv.org/abs/2408.15771

date added to LUP

2024-11-27 08:51:13

date last changed

2026-02-13 06:58:56

@inproceedings{4b3af846-795b-4ac7-956f-6aded73bc4e1,
  abstract     = {{We present a novel approach to the 3D sound source localization task for distributed ad-hoc microphone arrays by formulating it as a set-to-set regression problem. By training a multi-modal masked autoencoder model that operates on audio recordings and microphone coordinates, we show that such a formulation allows for accurate localization of the sound source, by reconstructing coordinates masked in the input. Our approach is flexible in the sense that a single model can be used with an arbitrary number of microphones, even when a subset of audio recordings and microphone coordinates are missing. We test our method on simulated and real-world recordings of music and speech in indoor environments, and demonstrate competitive performance compared to both classical and other learning based localization methods.}},
  author       = {{Berg, Axel and Gulin, Jens and O'Connor, Mark and Zhou, Chuteng and Åström, Kalle and Oskarsson, Magnus}},
  booktitle    = {{2024 14th International Conference on Indoor Positioning and Indoor Navigation (IPIN)}},
  isbn         = {{979-8-3503-6641-9}},
  issn         = {{2471-917X}},
  keywords     = {{sound source localization; masked autoencoders; transformers}},
  language     = {{eng}},
  pages        = {{1--8}},
  publisher    = {{IEEE - Institute of Electrical and Electronics Engineers Inc.}},
  series       = {{International Conference on Indoor Positioning and Indoor Navigation (IPIN)}},
  title        = {{wav2pos: Sound Source Localization using Masked Autoencoders}},
  url          = {{http://dx.doi.org/10.1109/IPIN62893.2024.10786105}},
  doi          = {{10.1109/IPIN62893.2024.10786105}},
  year         = {{2024}},
}

Lund University Publications

LUND UNIVERSITY LIBRARIES

wav2pos: Sound Source Localization using Masked Autoencoders