Linguistic and Syllabic Embeddings as Predictors in Auditory Attention Decoding
(2025)Department of Automatic Control
- Abstract (Swedish)
- The human brain has a built-in capacity to perform selective hearing, allowing us to focus on a single speaker in noisy environments while subconsciously filtering out background chatter and other noise. This ability is colloquially known as the cocktail party effect and is often impaired in individuals with reduced hearing or other neurological conditions. This results in an overall reduced quality of life, and current hearing aids offer limited support in addressing this issue. Auditory Attention Decoding (AAD) seeks to address this challenge by using neural signals recorded with electroencephalography (EEG) to determine where a user’s auditory attention lies.
In this thesis, the potential of contextualized speech representations to... (More) - The human brain has a built-in capacity to perform selective hearing, allowing us to focus on a single speaker in noisy environments while subconsciously filtering out background chatter and other noise. This ability is colloquially known as the cocktail party effect and is often impaired in individuals with reduced hearing or other neurological conditions. This results in an overall reduced quality of life, and current hearing aids offer limited support in addressing this issue. Auditory Attention Decoding (AAD) seeks to address this challenge by using neural signals recorded with electroencephalography (EEG) to determine where a user’s auditory attention lies.
In this thesis, the potential of contextualized speech representations to improve AAD performance is investigated. These representations are obtained from the Automatic Speech Recognition (ASR) models Wav2Vec2 and Sylber, two transformer-based self-supervised models for speech processing. The contextualized embeddings generated by these models are used as inputs in a forward modelling scenario, where Temporal Response Functions are fitted to predict neural responses to speech. A hybrid approach is also investigated, where the embeddings of both the attended and ignored audio streams, along with EEG signals, are processed through Deep Canonical Correlation Analysis (DCCA), which projects them into a shared latent space to maximize their shared correlation.
Consistent with previous research, we find that the linguistic embeddings generated by Wav2Vec2 can reliably serve as predictors of neural responses in forward modeling, and that their performance remains modestly strong even in noisy conditions. Further, we find that the syllabic embeddings generated by Sylber can serve a similar role on its own, but that this accuracy is severely hindered in even moderately noisy conditions. Additionally, the Deep Canonical Correlation Analysis (DCCA)- based approach demonstrated promising results for the mean correlation across latent dimensions as a feature, showcasing a clear sensitivity to increasing noise. Unlike previous studies that applied DCCA in single-speaker AAD settings, this work extends the approach to the more challenging two-speaker scenario, making it more representative of real-world auditory environments, and highlighting its potential relevance for real-world applications. (Less)
Please use this url to cite or link to this publication:
http://lup.lub.lu.se/student-papers/record/9206033
- author
- Löwgren, Simon and Nabage, Ruqayyah
- supervisor
- organization
- year
- 2025
- type
- H3 - Professional qualifications (4 Years - )
- subject
- report number
- TFRT-6276
- other publication id
- 0280-5316
- language
- English
- id
- 9206033
- date added to LUP
- 2025-08-08 15:11:53
- date last changed
- 2025-08-08 15:11:53
@misc{9206033, abstract = {{The human brain has a built-in capacity to perform selective hearing, allowing us to focus on a single speaker in noisy environments while subconsciously filtering out background chatter and other noise. This ability is colloquially known as the cocktail party effect and is often impaired in individuals with reduced hearing or other neurological conditions. This results in an overall reduced quality of life, and current hearing aids offer limited support in addressing this issue. Auditory Attention Decoding (AAD) seeks to address this challenge by using neural signals recorded with electroencephalography (EEG) to determine where a user’s auditory attention lies. In this thesis, the potential of contextualized speech representations to improve AAD performance is investigated. These representations are obtained from the Automatic Speech Recognition (ASR) models Wav2Vec2 and Sylber, two transformer-based self-supervised models for speech processing. The contextualized embeddings generated by these models are used as inputs in a forward modelling scenario, where Temporal Response Functions are fitted to predict neural responses to speech. A hybrid approach is also investigated, where the embeddings of both the attended and ignored audio streams, along with EEG signals, are processed through Deep Canonical Correlation Analysis (DCCA), which projects them into a shared latent space to maximize their shared correlation. Consistent with previous research, we find that the linguistic embeddings generated by Wav2Vec2 can reliably serve as predictors of neural responses in forward modeling, and that their performance remains modestly strong even in noisy conditions. Further, we find that the syllabic embeddings generated by Sylber can serve a similar role on its own, but that this accuracy is severely hindered in even moderately noisy conditions. Additionally, the Deep Canonical Correlation Analysis (DCCA)- based approach demonstrated promising results for the mean correlation across latent dimensions as a feature, showcasing a clear sensitivity to increasing noise. Unlike previous studies that applied DCCA in single-speaker AAD settings, this work extends the approach to the more challenging two-speaker scenario, making it more representative of real-world auditory environments, and highlighting its potential relevance for real-world applications.}}, author = {{Löwgren, Simon and Nabage, Ruqayyah}}, language = {{eng}}, note = {{Student Paper}}, title = {{Linguistic and Syllabic Embeddings as Predictors in Auditory Attention Decoding}}, year = {{2025}}, }