Learning Multi-Target TDOA Features for Sound Event Localization and Detection

Berg, Axel; Engman, Johanna; Gulin, Jens; Åström, Kalle; Oskarsson, Magnus

Learning Multi-Target TDOA Features for Sound Event Localization and Detection

Mark

Berg, Axel ^LU

; Engman, Johanna ^LU ; Gulin, Jens ^LU

; Åström, Kalle ^LU

and Oskarsson, Magnus ^LU

(2024) Workshop on Detection and Classification of Acoustic Scenes and Events, DCASE 2024 p.16-20

Abstract: Sound event localization and detection (SELD) systems using audio recordings from a microphone array rely on spatial cues for determining the location of sound events. As a consequence, the localization performance of such systems is to a large extent determined by the quality of the audio features that are used as inputs to the system. We propose a new feature, based on neural generalized cross-correlations with phase-transform (NGCC-PHAT), that learns audio representations suitable for localization. Using permutation invariant training for the time-difference of arrival (TDOA) estimation problem enables NGCC-PHAT to learn TDOA features for multiple overlapping sound events. These features can be used as a drop-in replacement for GCC-PHAT... (More); Sound event localization and detection (SELD) systems using audio recordings from a microphone array rely on spatial cues for determining the location of sound events. As a consequence, the localization performance of such systems is to a large extent determined by the quality of the audio features that are used as inputs to the system. We propose a new feature, based on neural generalized cross-correlations with phase-transform (NGCC-PHAT), that learns audio representations suitable for localization. Using permutation invariant training for the time-difference of arrival (TDOA) estimation problem enables NGCC-PHAT to learn TDOA features for multiple overlapping sound events. These features can be used as a drop-in replacement for GCC-PHAT inputs to a SELD-network. We test our method on the STARSS23 dataset and demonstrate improved localization performance compared to using standard GCC-PHAT or SALSA-Lite input features. (Less)

Please use this url to cite or link to this publication: https://lup.lub.lu.se/record/c2719617-5e34-4797-8f0d-61adc5c6108c

author

Berg, Axel ^LU

; Engman, Johanna ^LU ; Gulin, Jens ^LU

; Åström, Kalle ^LU

and Oskarsson, Magnus ^LU

organization

publishing date

2024

type

Chapter in Book/Report/Conference proceeding

publication status

published

subject

Signal Processing

keywords

sound event localization and detection, time difference of arrival, generalized cross-correlation

host publication

Proceedings of the Detection and Classification of Acoustic Scenes and Events 2024 Workshop (DCASE2024)

pages

16 - 20

publisher

Zenodo

conference name

Workshop on Detection and Classification of Acoustic Scenes and Events, DCASE 2024

conference location

Tokyo, Japan

conference dates

2024-10-23 - 2024-10-25

ISBN

978-952-03-3171-9

language

English

LU publication?

yes

id

c2719617-5e34-4797-8f0d-61adc5c6108c

alternative location

https://dcase.community/documents/workshop2024/proceedings/DCASE2024Workshop_Berg_46.pdf

date added to LUP

2024-11-04 08:28:34

date last changed

2025-04-04 14:12:51

@inproceedings{c2719617-5e34-4797-8f0d-61adc5c6108c,
  abstract     = {{Sound event localization and detection (SELD) systems using audio recordings from a microphone array rely on spatial cues for determining the location of sound events. As a consequence, the localization performance of such systems is to a large extent determined by the quality of the audio features that are used as inputs to the system. We propose a new feature, based on neural generalized cross-correlations with phase-transform (NGCC-PHAT), that learns audio representations suitable for localization. Using permutation invariant training for the time-difference of arrival (TDOA) estimation problem enables NGCC-PHAT to learn TDOA features for multiple overlapping sound events. These features can be used as a drop-in replacement for GCC-PHAT inputs to a SELD-network. We test our method on the STARSS23 dataset and demonstrate improved localization performance compared to using standard GCC-PHAT or SALSA-Lite input features.}},
  author       = {{Berg, Axel and Engman, Johanna and Gulin, Jens and Åström, Kalle and Oskarsson, Magnus}},
  booktitle    = {{Proceedings of the Detection and Classification of Acoustic Scenes and Events 2024 Workshop (DCASE2024)}},
  isbn         = {{978-952-03-3171-9}},
  keywords     = {{sound event localization and detection; time difference of arrival; generalized cross-correlation}},
  language     = {{eng}},
  pages        = {{16--20}},
  publisher    = {{Zenodo}},
  title        = {{Learning Multi-Target TDOA Features for Sound Event Localization and Detection}},
  url          = {{https://dcase.community/documents/workshop2024/proceedings/DCASE2024Workshop_Berg_46.pdf}},
  year         = {{2024}},
}

Lund University Publications

LUND UNIVERSITY LIBRARIES

Learning Multi-Target TDOA Features for Sound Event Localization and Detection