Skip to main content

LUP Student Papers

LUND UNIVERSITY LIBRARIES

Automatic Virtual Tracking in a Multi-Participant Scenario for a Mobile Device

Skoog, Emelie and Freberg, Elin (2022)
Department of Automatic Control
Abstract
The aim of this thesis is to identify a speaker in a video stream by only using the camera of a smartphone. This creates the opportunity for a mobile application to track the speaking participant during a conference call. Conference systems that include this feature use a microphone array to localize sound and who the speaker is. Mobile phones however, are generally not equipped with a microphone array and can therefore not recognize the source of the sound. Instead, this thesis will investigate how the speaker can be determined without using audio input. One method suggests that the speaker can be identified by detecting upper-body motion. This is based on the hypothesis that people often gesture while speaking. This method was... (More)
The aim of this thesis is to identify a speaker in a video stream by only using the camera of a smartphone. This creates the opportunity for a mobile application to track the speaking participant during a conference call. Conference systems that include this feature use a microphone array to localize sound and who the speaker is. Mobile phones however, are generally not equipped with a microphone array and can therefore not recognize the source of the sound. Instead, this thesis will investigate how the speaker can be determined without using audio input. One method suggests that the speaker can be identified by detecting upper-body motion. This is based on the hypothesis that people often gesture while speaking. This method was implemented as a mobile application and the accuracy was tested. The application uses the camera of a mobile device for input frames and the built-in face detector to retrieve the position of the participants. The bounds of the upper-body for each participant were estimated by using the face bounds from the face detector. The application can then detect and compare motion by calculating the optical flow in each of these bounds. The participant with the most movement will be seen as the speaker. If movement is below a certain threshold value however the application will treat this as if no one is speaking. Lastly this approach was evaluated against a transcribed data-set with videos of two participants. Some variations were also investigated with the aim to increase the accuracy. The application had an accuracy of determining the speaker about 50% of the testing time. However, some alterations were implemented that did improve the results slightly, which indicates that a bigger parameter search could increase the result further. (Less)
Please use this url to cite or link to this publication:
author
Skoog, Emelie and Freberg, Elin
supervisor
organization
year
type
H3 - Professional qualifications (4 Years - )
subject
report number
TFRT-6179
ISSN
0280-5316
language
English
id
9097048
date added to LUP
2022-08-12 09:46:23
date last changed
2022-08-12 09:46:23
@misc{9097048,
  abstract     = {{The aim of this thesis is to identify a speaker in a video stream by only using the camera of a smartphone. This creates the opportunity for a mobile application to track the speaking participant during a conference call. Conference systems that include this feature use a microphone array to localize sound and who the speaker is. Mobile phones however, are generally not equipped with a microphone array and can therefore not recognize the source of the sound. Instead, this thesis will investigate how the speaker can be determined without using audio input. One method suggests that the speaker can be identified by detecting upper-body motion. This is based on the hypothesis that people often gesture while speaking. This method was implemented as a mobile application and the accuracy was tested. The application uses the camera of a mobile device for input frames and the built-in face detector to retrieve the position of the participants. The bounds of the upper-body for each participant were estimated by using the face bounds from the face detector. The application can then detect and compare motion by calculating the optical flow in each of these bounds. The participant with the most movement will be seen as the speaker. If movement is below a certain threshold value however the application will treat this as if no one is speaking. Lastly this approach was evaluated against a transcribed data-set with videos of two participants. Some variations were also investigated with the aim to increase the accuracy. The application had an accuracy of determining the speaker about 50% of the testing time. However, some alterations were implemented that did improve the results slightly, which indicates that a bigger parameter search could increase the result further.}},
  author       = {{Skoog, Emelie and Freberg, Elin}},
  issn         = {{0280-5316}},
  language     = {{eng}},
  note         = {{Student Paper}},
  title        = {{Automatic Virtual Tracking in a Multi-Participant Scenario for a Mobile Device}},
  year         = {{2022}},
}