Learning Multi-Target TDOA Features for Sound Event Localization and Detection
(2024) Workshop on Detection and Classification of Acoustic Scenes and Events, DCASE 2024 p.16-20- Abstract
- Sound event localization and detection (SELD) systems using audio recordings from a microphone array rely on spatial cues for determining the location of sound events. As a consequence, the localization performance of such systems is to a large extent determined by the quality of the audio features that are used as inputs to the system. We propose a new feature, based on neural generalized cross-correlations with phase-transform (NGCC-PHAT), that learns audio representations suitable for localization. Using permutation invariant training for the time-difference of arrival (TDOA) estimation problem enables NGCC-PHAT to learn TDOA features for multiple overlapping sound events. These features can be used as a drop-in replacement for GCC-PHAT... (More)
- Sound event localization and detection (SELD) systems using audio recordings from a microphone array rely on spatial cues for determining the location of sound events. As a consequence, the localization performance of such systems is to a large extent determined by the quality of the audio features that are used as inputs to the system. We propose a new feature, based on neural generalized cross-correlations with phase-transform (NGCC-PHAT), that learns audio representations suitable for localization. Using permutation invariant training for the time-difference of arrival (TDOA) estimation problem enables NGCC-PHAT to learn TDOA features for multiple overlapping sound events. These features can be used as a drop-in replacement for GCC-PHAT inputs to a SELD-network. We test our method on the STARSS23 dataset and demonstrate improved localization performance compared to using standard GCC-PHAT or SALSA-Lite input features. (Less)
Please use this url to cite or link to this publication:
https://lup.lub.lu.se/record/c2719617-5e34-4797-8f0d-61adc5c6108c
- author
- Berg, Axel
LU
; Engman, Johanna
LU
; Gulin, Jens
LU
; Åström, Kalle
LU
and Oskarsson, Magnus
LU
- organization
- publishing date
- 2024
- type
- Chapter in Book/Report/Conference proceeding
- publication status
- published
- subject
- keywords
- sound event localization and detection, time difference of arrival, generalized cross-correlation
- host publication
- Proceedings of the Detection and Classification of Acoustic Scenes and Events 2024 Workshop (DCASE2024)
- pages
- 16 - 20
- publisher
- Zenodo
- conference name
- Workshop on Detection and Classification of Acoustic Scenes and Events, DCASE 2024
- conference location
- Tokyo, Japan
- conference dates
- 2024-10-23 - 2024-10-25
- ISBN
- 978-952-03-3171-9
- language
- English
- LU publication?
- yes
- id
- c2719617-5e34-4797-8f0d-61adc5c6108c
- alternative location
- https://dcase.community/documents/workshop2024/proceedings/DCASE2024Workshop_Berg_46.pdf
- date added to LUP
- 2024-11-04 08:28:34
- date last changed
- 2025-09-16 08:51:34
@inproceedings{c2719617-5e34-4797-8f0d-61adc5c6108c,
abstract = {{Sound event localization and detection (SELD) systems using audio recordings from a microphone array rely on spatial cues for determining the location of sound events. As a consequence, the localization performance of such systems is to a large extent determined by the quality of the audio features that are used as inputs to the system. We propose a new feature, based on neural generalized cross-correlations with phase-transform (NGCC-PHAT), that learns audio representations suitable for localization. Using permutation invariant training for the time-difference of arrival (TDOA) estimation problem enables NGCC-PHAT to learn TDOA features for multiple overlapping sound events. These features can be used as a drop-in replacement for GCC-PHAT inputs to a SELD-network. We test our method on the STARSS23 dataset and demonstrate improved localization performance compared to using standard GCC-PHAT or SALSA-Lite input features.}},
author = {{Berg, Axel and Engman, Johanna and Gulin, Jens and Åström, Kalle and Oskarsson, Magnus}},
booktitle = {{Proceedings of the Detection and Classification of Acoustic Scenes and Events 2024 Workshop (DCASE2024)}},
isbn = {{978-952-03-3171-9}},
keywords = {{sound event localization and detection; time difference of arrival; generalized cross-correlation}},
language = {{eng}},
pages = {{16--20}},
publisher = {{Zenodo}},
title = {{Learning Multi-Target TDOA Features for Sound Event Localization and Detection}},
url = {{https://dcase.community/documents/workshop2024/proceedings/DCASE2024Workshop_Berg_46.pdf}},
year = {{2024}},
}