THE LU SYSTEM FOR DCASE 2024 SOUND EVENT LOCALIZATION AND DETECTION CHALLENGE
(2024)- Abstract
- This technical report gives an overview of our submission to task 3 of the DCASE 2024 challenge. We present a sound event localization and detection (SELD) system using input features based on trainable neural generalized cross-correlations with phase transform (NGCC-PHAT). With these features together with spectrograms as input to a Transformer-based network, we achieve significant improvements over the baseline method. In addition, we also present an audio-visual version of our system, where distance predictions are updated using depth maps from the panorama video frames.
Please use this url to cite or link to this publication:
https://lup.lub.lu.se/record/ccb5d1f3-8c87-4398-9b2d-c3260c0f2fd3
- author
- Berg, Axel
LU
; Engman, Johanna LU ; Gulin, Jens LU
; Åström, Karl LU
and Oskarsson, Magnus LU
- organization
-
- Computer Vision and Machine Learning (research group)
- LU Profile Area: Natural and Artificial Cognition
- Integrated Electronic Systems
- Mathematical Imaging Group (research group)
- LTH Profile Area: AI and Digitalization
- ELLIIT: the Linköping-Lund initiative on IT and mobile communication
- eSSENCE: The e-Science Collaboration
- Stroke Imaging Research group (research group)
- LTH Profile Area: Engineering Health
- publishing date
- 2024-06-30
- type
- Book/Report
- publication status
- published
- subject
- language
- English
- LU publication?
- yes
- id
- ccb5d1f3-8c87-4398-9b2d-c3260c0f2fd3
- alternative location
- https://dcase.community/documents/challenge2024/technical_reports/DCASE2024_Berg_24_t3.pdf
- date added to LUP
- 2024-07-22 21:57:31
- date last changed
- 2025-04-04 14:21:11
@techreport{ccb5d1f3-8c87-4398-9b2d-c3260c0f2fd3, abstract = {{This technical report gives an overview of our submission to task 3 of the DCASE 2024 challenge. We present a sound event localization and detection (SELD) system using input features based on trainable neural generalized cross-correlations with phase transform (NGCC-PHAT). With these features together with spectrograms as input to a Transformer-based network, we achieve significant improvements over the baseline method. In addition, we also present an audio-visual version of our system, where distance predictions are updated using depth maps from the panorama video frames.}}, author = {{Berg, Axel and Engman, Johanna and Gulin, Jens and Åström, Karl and Oskarsson, Magnus}}, language = {{eng}}, month = {{06}}, title = {{THE LU SYSTEM FOR DCASE 2024 SOUND EVENT LOCALIZATION AND DETECTION CHALLENGE}}, url = {{https://dcase.community/documents/challenge2024/technical_reports/DCASE2024_Berg_24_t3.pdf}}, year = {{2024}}, }