Low-Variance Multitaper MFCC Features: A Case Study in Robust Speaker Verification

Kinnunen, Tomi; Saeidi, Rahim; Sedlak, Filip; Lee, Kong Aik; Sandberg, Johan; Sandsten, Maria; Li, Haizhou

Low-Variance Multitaper MFCC Features: A Case Study in Robust Speaker Verification

Mark

Kinnunen, Tomi ; Saeidi, Rahim ; Sedlak, Filip ; Lee, Kong Aik ; Sandberg, Johan ^LU ; Sandsten, Maria ^LU and Li, Haizhou (2012) In IEEE Transactions on Audio, Speech, and Language Processing 20(7). p.1990-2001

Abstract: In speech and audio applications, short-term signal spectrum is often represented using mel-frequency cepstral coefficients (MFCCs) computed from a windowed discrete Fourier transform (DFT). Windowing reduces spectral leakage but variance of the spectrum estimate remains high. An elegant extension to windowed DFT is the so-called multitaper method which uses multiple time-domain windows (tapers) with frequency-domain averaging. Multitapers have received little attention in speech processing even though they produce low-variance features. In this paper, we propose the multitaper method for MFCC extraction with a practical focus. We provide, first, detailed statistical analysis of MFCC bias and variance using autoregressive process... (More); In speech and audio applications, short-term signal spectrum is often represented using mel-frequency cepstral coefficients (MFCCs) computed from a windowed discrete Fourier transform (DFT). Windowing reduces spectral leakage but variance of the spectrum estimate remains high. An elegant extension to windowed DFT is the so-called multitaper method which uses multiple time-domain windows (tapers) with frequency-domain averaging. Multitapers have received little attention in speech processing even though they produce low-variance features. In this paper, we propose the multitaper method for MFCC extraction with a practical focus. We provide, first, detailed statistical analysis of MFCC bias and variance using autoregressive process simulations on the TIMIT corpus. For speaker verification experiments on the NIST 2002 and 2008 SRE corpora, we consider three Gaussian mixture model based classifiers with universal background model (GMM-UBM), support vector machine (GMM-SVM) and joint factor analysis (GMM-JFA). Multitapers improve MinDCF over the baseline windowed DFT by relative 20.4% (GMM-SVM) and 13.7% (GMM-JFA) on the interview-interview condition in NIST 2008. The GMM-JFA system further reduces MinDCF by 18.7% on the telephone data. With these improvements and generally noncritical parameter selection, multitaper MFCCs are a viable candidate for replacing the conventional MFCCs. (Less)

Please use this url to cite or link to this publication: https://lup.lub.lu.se/record/2826390

author

Kinnunen, Tomi ; Saeidi, Rahim ; Sedlak, Filip ; Lee, Kong Aik ; Sandberg, Johan ^LU ; Sandsten, Maria ^LU and Li, Haizhou

organization

publishing date

2012

type

Contribution to journal

publication status

published

subject

Probability Theory and Statistics

keywords

Mel-frequency cepstral coefficient (MFCC), multitaper, small-variance, estimation, speaker verification

in

IEEE Transactions on Audio, Speech, and Language Processing

volume

20

issue

7

pages

1990 - 2001

publisher

IEEE - Institute of Electrical and Electronics Engineers Inc.

external identifiers

wos:000303893000003
scopus:84860850285

ISSN

1558-7924

DOI

10.1109/TASL.2012.2191960

language

English

LU publication?

yes

id

e055865e-5dbc-4a3f-8a32-1629664eca7f (old id 2826390)

date added to LUP

2016-04-01 10:00:55

date last changed

2025-10-14 09:25:12

@article{e055865e-5dbc-4a3f-8a32-1629664eca7f,
  abstract     = {{In speech and audio applications, short-term signal spectrum is often represented using mel-frequency cepstral coefficients (MFCCs) computed from a windowed discrete Fourier transform (DFT). Windowing reduces spectral leakage but variance of the spectrum estimate remains high. An elegant extension to windowed DFT is the so-called multitaper method which uses multiple time-domain windows (tapers) with frequency-domain averaging. Multitapers have received little attention in speech processing even though they produce low-variance features. In this paper, we propose the multitaper method for MFCC extraction with a practical focus. We provide, first, detailed statistical analysis of MFCC bias and variance using autoregressive process simulations on the TIMIT corpus. For speaker verification experiments on the NIST 2002 and 2008 SRE corpora, we consider three Gaussian mixture model based classifiers with universal background model (GMM-UBM), support vector machine (GMM-SVM) and joint factor analysis (GMM-JFA). Multitapers improve MinDCF over the baseline windowed DFT by relative 20.4% (GMM-SVM) and 13.7% (GMM-JFA) on the interview-interview condition in NIST 2008. The GMM-JFA system further reduces MinDCF by 18.7% on the telephone data. With these improvements and generally noncritical parameter selection, multitaper MFCCs are a viable candidate for replacing the conventional MFCCs.}},
  author       = {{Kinnunen, Tomi and Saeidi, Rahim and Sedlak, Filip and Lee, Kong Aik and Sandberg, Johan and Sandsten, Maria and Li, Haizhou}},
  issn         = {{1558-7924}},
  keywords     = {{Mel-frequency cepstral coefficient (MFCC); multitaper; small-variance; estimation; speaker verification}},
  language     = {{eng}},
  number       = {{7}},
  pages        = {{1990--2001}},
  publisher    = {{IEEE - Institute of Electrical and Electronics Engineers Inc.}},
  series       = {{IEEE Transactions on Audio, Speech, and Language Processing}},
  title        = {{Low-Variance Multitaper MFCC Features: A Case Study in Robust Speaker Verification}},
  url          = {{http://dx.doi.org/10.1109/TASL.2012.2191960}},
  doi          = {{10.1109/TASL.2012.2191960}},
  volume       = {{20}},
  year         = {{2012}},
}

Lund University Publications

LUND UNIVERSITY LIBRARIES

Low-Variance Multitaper MFCC Features: A Case Study in Robust Speaker Verification