Speech activity detection in videos

Andersson, Viktor; Ostréus, Nelly

Speech activity detection in videos

Mark

Andersson, Viktor and Ostréus, Nelly (2022)
Department of Automatic Control

Abstract: Speech is an important way of communication all over the world. The speech information is encoded both aural and visual. More than 1.5 billion people have hearing loss and for those the visual information is even more important than for people with normal hearing. Lip reading is therefore an important research topic.
In this master thesis, machine learning algorithms were used to identify speech activity in realistic video with monologues and dialogues. Each video contained three persons speaking: one performing a monologue and two performing a dialogue. Support vector machines for linear, radial basis function, sigmoid and polynomial kernels were used to classify the audio as either speech or non-speech based on faces from realistic... (More); Speech is an important way of communication all over the world. The speech information is encoded both aural and visual. More than 1.5 billion people have hearing loss and for those the visual information is even more important than for people with normal hearing. Lip reading is therefore an important research topic.
In this master thesis, machine learning algorithms were used to identify speech activity in realistic video with monologues and dialogues. Each video contained three persons speaking: one performing a monologue and two performing a dialogue. Support vector machines for linear, radial basis function, sigmoid and polynomial kernels were used to classify the audio as either speech or non-speech based on faces from realistic videos. A speech envelope was calculated and resampled to four Hertz. Based on a threshold of the envelope, the ground truth was created and each audio data point was selected to be either speech or non-speech. Convolutional neural networks using max-margin object detection were used to extract facial landmarks from the videos. Six different video features were calculated and used: the mouth opening distances, the variance of the mouth opening distances and the difference of mouth opening distances between several frames, the mouth area, the variance of the area and the difference of area between several frames.
The mean accuracy for the speech activity in the monologues were low. This was probably due to the unbalanced data in the monologues, since most data in the ground truth were classified as speech. For the dialogues, the accuracy were slightly higher than classifying everything as the most frequent class. The variance of the mouth area was the best performing feature. The performance varies between the videos and combining the best mouth opening distances feature with the best mouth area feature for the two best kernels, increased the accuracy for the best performing videos. (Less)

- Open Access
- |
- PDF

Links

Document download statistics

Related Materials

Related object is popular science:
Popular Science summary

Please use this url to cite or link to this publication: http://lup.lub.lu.se/student-papers/record/9094781

author

Andersson, Viktor and Ostréus, Nelly

supervisor

organization

Department of Automatic Control

year

2022

type

H3 - Professional qualifications (4 Years - )

subject

Technology and Engineering

report number

TFRT-6171

ISSN

0280-5316

language

English

id

9094781

date added to LUP

2022-08-12 09:55:36

date last changed

2022-08-12 09:55:36

@misc{9094781,
  abstract     = {{Speech is an important way of communication all over the world. The speech information is encoded both aural and visual. More than 1.5 billion people have hearing loss and for those the visual information is even more important than for people with normal hearing. Lip reading is therefore an important research topic.
 In this master thesis, machine learning algorithms were used to identify speech activity in realistic video with monologues and dialogues. Each video contained three persons speaking: one performing a monologue and two performing a dialogue. Support vector machines for linear, radial basis function, sigmoid and polynomial kernels were used to classify the audio as either speech or non-speech based on faces from realistic videos. A speech envelope was calculated and resampled to four Hertz. Based on a threshold of the envelope, the ground truth was created and each audio data point was selected to be either speech or non-speech. Convolutional neural networks using max-margin object detection were used to extract facial landmarks from the videos. Six different video features were calculated and used: the mouth opening distances, the variance of the mouth opening distances and the difference of mouth opening distances between several frames, the mouth area, the variance of the area and the difference of area between several frames.
 The mean accuracy for the speech activity in the monologues were low. This was probably due to the unbalanced data in the monologues, since most data in the ground truth were classified as speech. For the dialogues, the accuracy were slightly higher than classifying everything as the most frequent class. The variance of the mouth area was the best performing feature. The performance varies between the videos and combining the best mouth opening distances feature with the best mouth area feature for the two best kernels, increased the accuracy for the best performing videos.}},
  author       = {{Andersson, Viktor and Ostréus, Nelly}},
  issn         = {{0280-5316}},
  language     = {{eng}},
  note         = {{Student Paper}},
  title        = {{Speech activity detection in videos}},
  year         = {{2022}},
}

LUP Student Papers

LUND UNIVERSITY LIBRARIES

Speech activity detection in videos