Skip to main content

LUP Student Papers

LUND UNIVERSITY LIBRARIES

Speech activity detection in videos

Andersson, Viktor and Ostréus, Nelly (2022)
Department of Automatic Control
Abstract
Speech is an important way of communication all over the world. The speech information is encoded both aural and visual. More than 1.5 billion people have hearing loss and for those the visual information is even more important than for people with normal hearing. Lip reading is therefore an important research topic.
In this master thesis, machine learning algorithms were used to identify speech activity in realistic video with monologues and dialogues. Each video contained three persons speaking: one performing a monologue and two performing a dialogue. Support vector machines for linear, radial basis function, sigmoid and polynomial kernels were used to classify the audio as either speech or non-speech based on faces from realistic... (More)
Speech is an important way of communication all over the world. The speech information is encoded both aural and visual. More than 1.5 billion people have hearing loss and for those the visual information is even more important than for people with normal hearing. Lip reading is therefore an important research topic.
In this master thesis, machine learning algorithms were used to identify speech activity in realistic video with monologues and dialogues. Each video contained three persons speaking: one performing a monologue and two performing a dialogue. Support vector machines for linear, radial basis function, sigmoid and polynomial kernels were used to classify the audio as either speech or non-speech based on faces from realistic videos. A speech envelope was calculated and resampled to four Hertz. Based on a threshold of the envelope, the ground truth was created and each audio data point was selected to be either speech or non-speech. Convolutional neural networks using max-margin object detection were used to extract facial landmarks from the videos. Six different video features were calculated and used: the mouth opening distances, the variance of the mouth opening distances and the difference of mouth opening distances between several frames, the mouth area, the variance of the area and the difference of area between several frames.
The mean accuracy for the speech activity in the monologues were low. This was probably due to the unbalanced data in the monologues, since most data in the ground truth were classified as speech. For the dialogues, the accuracy were slightly higher than classifying everything as the most frequent class. The variance of the mouth area was the best performing feature. The performance varies between the videos and combining the best mouth opening distances feature with the best mouth area feature for the two best kernels, increased the accuracy for the best performing videos. (Less)
Please use this url to cite or link to this publication:
author
Andersson, Viktor and Ostréus, Nelly
supervisor
organization
year
type
H3 - Professional qualifications (4 Years - )
subject
report number
TFRT-6171
ISSN
0280-5316
language
English
id
9094781
date added to LUP
2022-08-12 09:55:36
date last changed
2022-08-12 09:55:36
@misc{9094781,
  abstract     = {{Speech is an important way of communication all over the world. The speech information is encoded both aural and visual. More than 1.5 billion people have hearing loss and for those the visual information is even more important than for people with normal hearing. Lip reading is therefore an important research topic.
 In this master thesis, machine learning algorithms were used to identify speech activity in realistic video with monologues and dialogues. Each video contained three persons speaking: one performing a monologue and two performing a dialogue. Support vector machines for linear, radial basis function, sigmoid and polynomial kernels were used to classify the audio as either speech or non-speech based on faces from realistic videos. A speech envelope was calculated and resampled to four Hertz. Based on a threshold of the envelope, the ground truth was created and each audio data point was selected to be either speech or non-speech. Convolutional neural networks using max-margin object detection were used to extract facial landmarks from the videos. Six different video features were calculated and used: the mouth opening distances, the variance of the mouth opening distances and the difference of mouth opening distances between several frames, the mouth area, the variance of the area and the difference of area between several frames.
 The mean accuracy for the speech activity in the monologues were low. This was probably due to the unbalanced data in the monologues, since most data in the ground truth were classified as speech. For the dialogues, the accuracy were slightly higher than classifying everything as the most frequent class. The variance of the mouth area was the best performing feature. The performance varies between the videos and combining the best mouth opening distances feature with the best mouth area feature for the two best kernels, increased the accuracy for the best performing videos.}},
  author       = {{Andersson, Viktor and Ostréus, Nelly}},
  issn         = {{0280-5316}},
  language     = {{eng}},
  note         = {{Student Paper}},
  title        = {{Speech activity detection in videos}},
  year         = {{2022}},
}