wav2pos: Sound Source Localization using Masked Autoencoders
Berg, Axel; Gulin, Jens; O'Connor, Mark; Zhou, Chuteng, et al. (2024). wav2pos: Sound Source Localization using Masked Autoencoders 2024 14th International Conference on Indoor Positioning and Indoor Navigation (IPIN), 1 - 8. 2024 14th International Conference on Indoor Positioning and Indoor Navigation (IPIN). Hong Kong: IEEE - Institute of Electrical and Electronics Engineers Inc.
Conference Proceeding/Paper
|
Published
|
English
Authors:
Berg, Axel
;
Gulin, Jens
;
O'Connor, Mark
;
Zhou, Chuteng
, et al.
Department:
Computer Vision and Machine Learning
Integrated Electronic Systems
LU Profile Area: Natural and Artificial Cognition
ELLIIT: the Linköping-Lund initiative on IT and mobile communication
Project:
Deep Learning for Simultaneous Localization and Mapping
Research Group:
Computer Vision and Machine Learning
Integrated Electronic Systems
Abstract:
We present a novel approach to the 3D sound source localization task for distributed ad-hoc microphone arrays by formulating it as a set-to-set regression problem. By training a multi-modal masked autoencoder model that operates on audio recordings and microphone coordinates, we show that such a formulation allows for accurate localization of the sound source, by reconstructing coordinates masked in the input. Our approach is flexible in the sense that a single model can be used with an arbitrary number of microphones, even when a subset of audio recordings and microphone coordinates are missing. We test our method on simulated and real-world recordings of music and speech in indoor environments, and demonstrate competitive performance compared to both classical and other learning based localization methods.
Keywords:
sound source localization ;
masked autoencoders ;
transformers
Cite this