DMEL : THE DIFFERENTIABLE LOG-MEL SPECTROGRAM AS A TRAINABLE LAYER IN NEURAL NETWORKS
(2024) 49th IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2024 In ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings p.5005-5009- Abstract
In this paper we present the differentiable log-Mel spectrogram (DMEL) for audio classification. DMEL uses a Gaussian window, with a window length that can be jointly optimized with the neural network. DMEL is used as the input layer in different neural networks and evaluated on standard audio datasets. We show that DMEL achieves a higher average test accuracy for sub-optimal initial choices of the window length when compared to a baseline with a fixed window length. In addition, we analyse the computational cost of DMEL and compare to a standard hyperparameter search over different window lengths, showing favorable results for DMEL. Finally, an empirical evaluation on a carefully designed dataset is performed to investigate if the... (More)
In this paper we present the differentiable log-Mel spectrogram (DMEL) for audio classification. DMEL uses a Gaussian window, with a window length that can be jointly optimized with the neural network. DMEL is used as the input layer in different neural networks and evaluated on standard audio datasets. We show that DMEL achieves a higher average test accuracy for sub-optimal initial choices of the window length when compared to a baseline with a fixed window length. In addition, we analyse the computational cost of DMEL and compare to a standard hyperparameter search over different window lengths, showing favorable results for DMEL. Finally, an empirical evaluation on a carefully designed dataset is performed to investigate if the differentiable spectrogram actually learns the optimal window length. The design of the dataset relies on the theory of spectrogram resolution. We also empirically evaluate the convergence rate to the optimal window length.
(Less)
- author
- Martinsson, John LU and Sandsten, Maria LU
- organization
-
- Mathematical Statistics
- LU Profile Area: Light and Materials
- LU Profile Area: Natural and Artificial Cognition
- LTH Profile Area: Nanoscience and Semiconductor Technology
- LTH Profile Area: AI and Digitalization
- LTH Profile Area: Engineering Health
- NanoLund: Centre for Nanoscience
- eSSENCE: The e-Science Collaboration
- Statistical Signal Processing Group (research group)
- publishing date
- 2024
- type
- Chapter in Book/Report/Conference proceeding
- publication status
- published
- subject
- keywords
- adaptive transforms, audio classification, Deep learning, learnable Mel spectrogram, STFT
- host publication
- 2024 IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2024 - Proceedings
- series title
- ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings
- pages
- 5 pages
- publisher
- IEEE - Institute of Electrical and Electronics Engineers Inc.
- conference name
- 49th IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2024
- conference location
- Seoul, Korea, Republic of
- conference dates
- 2024-04-14 - 2024-04-19
- external identifiers
-
- scopus:85195408870
- ISSN
- 1520-6149
- ISBN
- 9798350344851
- DOI
- 10.1109/ICASSP48485.2024.10446816
- language
- English
- LU publication?
- yes
- id
- 32c15a29-82d8-4558-9537-0ece70f446fc
- date added to LUP
- 2024-09-12 14:27:35
- date last changed
- 2024-09-12 16:20:13
@inproceedings{32c15a29-82d8-4558-9537-0ece70f446fc, abstract = {{<p>In this paper we present the differentiable log-Mel spectrogram (DMEL) for audio classification. DMEL uses a Gaussian window, with a window length that can be jointly optimized with the neural network. DMEL is used as the input layer in different neural networks and evaluated on standard audio datasets. We show that DMEL achieves a higher average test accuracy for sub-optimal initial choices of the window length when compared to a baseline with a fixed window length. In addition, we analyse the computational cost of DMEL and compare to a standard hyperparameter search over different window lengths, showing favorable results for DMEL. Finally, an empirical evaluation on a carefully designed dataset is performed to investigate if the differentiable spectrogram actually learns the optimal window length. The design of the dataset relies on the theory of spectrogram resolution. We also empirically evaluate the convergence rate to the optimal window length.</p>}}, author = {{Martinsson, John and Sandsten, Maria}}, booktitle = {{2024 IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2024 - Proceedings}}, isbn = {{9798350344851}}, issn = {{1520-6149}}, keywords = {{adaptive transforms; audio classification; Deep learning; learnable Mel spectrogram; STFT}}, language = {{eng}}, pages = {{5005--5009}}, publisher = {{IEEE - Institute of Electrical and Electronics Engineers Inc.}}, series = {{ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings}}, title = {{DMEL : THE DIFFERENTIABLE LOG-MEL SPECTROGRAM AS A TRAINABLE LAYER IN NEURAL NETWORKS}}, url = {{http://dx.doi.org/10.1109/ICASSP48485.2024.10446816}}, doi = {{10.1109/ICASSP48485.2024.10446816}}, year = {{2024}}, }