Linguistic and Syllabic Embeddings as Predictors in Auditory Attention Decoding

Löwgren, Simon; Nabage, Ruqayyah

Linguistic and Syllabic Embeddings as Predictors in Auditory Attention Decoding

Mark

Löwgren, Simon and Nabage, Ruqayyah (2025)
Department of Automatic Control

Abstract (Swedish): The human brain has a built-in capacity to perform selective hearing, allowing us to focus on a single speaker in noisy environments while subconsciously filtering out background chatter and other noise. This ability is colloquially known as the cocktail party effect and is often impaired in individuals with reduced hearing or other neurological conditions. This results in an overall reduced quality of life, and current hearing aids offer limited support in addressing this issue. Auditory Attention Decoding (AAD) seeks to address this challenge by using neural signals recorded with electroencephalography (EEG) to determine where a user’s auditory attention lies.
In this thesis, the potential of contextualized speech representations to... (More); The human brain has a built-in capacity to perform selective hearing, allowing us to focus on a single speaker in noisy environments while subconsciously filtering out background chatter and other noise. This ability is colloquially known as the cocktail party effect and is often impaired in individuals with reduced hearing or other neurological conditions. This results in an overall reduced quality of life, and current hearing aids offer limited support in addressing this issue. Auditory Attention Decoding (AAD) seeks to address this challenge by using neural signals recorded with electroencephalography (EEG) to determine where a user’s auditory attention lies.
In this thesis, the potential of contextualized speech representations to improve AAD performance is investigated. These representations are obtained from the Automatic Speech Recognition (ASR) models Wav2Vec2 and Sylber, two transformer-based self-supervised models for speech processing. The contextualized embeddings generated by these models are used as inputs in a forward modelling scenario, where Temporal Response Functions are fitted to predict neural responses to speech. A hybrid approach is also investigated, where the embeddings of both the attended and ignored audio streams, along with EEG signals, are processed through Deep Canonical Correlation Analysis (DCCA), which projects them into a shared latent space to maximize their shared correlation.
Consistent with previous research, we find that the linguistic embeddings generated by Wav2Vec2 can reliably serve as predictors of neural responses in forward modeling, and that their performance remains modestly strong even in noisy conditions. Further, we find that the syllabic embeddings generated by Sylber can serve a similar role on its own, but that this accuracy is severely hindered in even moderately noisy conditions. Additionally, the Deep Canonical Correlation Analysis (DCCA)- based approach demonstrated promising results for the mean correlation across latent dimensions as a feature, showcasing a clear sensitivity to increasing noise. Unlike previous studies that applied DCCA in single-speaker AAD settings, this work extends the approach to the more challenging two-speaker scenario, making it more representative of real-world auditory environments, and highlighting its potential relevance for real-world applications. (Less)

- Open Access
- |
- PDF

Links

Document download statistics

Related Materials

Related object is popular science:
Popular Science summary

Please use this url to cite or link to this publication: http://lup.lub.lu.se/student-papers/record/9206033

author

Löwgren, Simon and Nabage, Ruqayyah

supervisor

organization

Department of Automatic Control

year

2025

type

H3 - Professional qualifications (4 Years - )

subject

Technology and Engineering

report number

TFRT-6276

other publication id

0280-5316

language

English

id

9206033

date added to LUP

2025-08-08 15:11:53

date last changed

2025-08-08 15:11:53

@misc{9206033,
  abstract     = {{The human brain has a built-in capacity to perform selective hearing, allowing us to focus on a single speaker in noisy environments while subconsciously filtering out background chatter and other noise. This ability is colloquially known as the cocktail party effect and is often impaired in individuals with reduced hearing or other neurological conditions. This results in an overall reduced quality of life, and current hearing aids offer limited support in addressing this issue. Auditory Attention Decoding (AAD) seeks to address this challenge by using neural signals recorded with electroencephalography (EEG) to determine where a user’s auditory attention lies.
In this thesis, the potential of contextualized speech representations to improve AAD performance is investigated. These representations are obtained from the Automatic Speech Recognition (ASR) models Wav2Vec2 and Sylber, two transformer-based self-supervised models for speech processing. The contextualized embeddings generated by these models are used as inputs in a forward modelling scenario, where Temporal Response Functions are fitted to predict neural responses to speech. A hybrid approach is also investigated, where the embeddings of both the attended and ignored audio streams, along with EEG signals, are processed through Deep Canonical Correlation Analysis (DCCA), which projects them into a shared latent space to maximize their shared correlation.
Consistent with previous research, we find that the linguistic embeddings generated by Wav2Vec2 can reliably serve as predictors of neural responses in forward modeling, and that their performance remains modestly strong even in noisy conditions. Further, we find that the syllabic embeddings generated by Sylber can serve a similar role on its own, but that this accuracy is severely hindered in even moderately noisy conditions. Additionally, the Deep Canonical Correlation Analysis (DCCA)- based approach demonstrated promising results for the mean correlation across latent dimensions as a feature, showcasing a clear sensitivity to increasing noise. Unlike previous studies that applied DCCA in single-speaker AAD settings, this work extends the approach to the more challenging two-speaker scenario, making it more representative of real-world auditory environments, and highlighting its potential relevance for real-world applications.}},
  author       = {{Löwgren, Simon and Nabage, Ruqayyah}},
  language     = {{eng}},
  note         = {{Student Paper}},
  title        = {{Linguistic and Syllabic Embeddings as Predictors in Auditory Attention Decoding}},
  year         = {{2025}},
}

LUP Student Papers

LUND UNIVERSITY LIBRARIES

Linguistic and Syllabic Embeddings as Predictors in Auditory Attention Decoding