Automatic Virtual Tracking in a Multi-Participant Scenario for a Mobile Device

Skoog, Emelie; Freberg, Elin

Automatic Virtual Tracking in a Multi-Participant Scenario for a Mobile Device

Mark

Skoog, Emelie and Freberg, Elin (2022)
Department of Automatic Control

Abstract: The aim of this thesis is to identify a speaker in a video stream by only using the camera of a smartphone. This creates the opportunity for a mobile application to track the speaking participant during a conference call. Conference systems that include this feature use a microphone array to localize sound and who the speaker is. Mobile phones however, are generally not equipped with a microphone array and can therefore not recognize the source of the sound. Instead, this thesis will investigate how the speaker can be determined without using audio input. One method suggests that the speaker can be identified by detecting upper-body motion. This is based on the hypothesis that people often gesture while speaking. This method was... (More); The aim of this thesis is to identify a speaker in a video stream by only using the camera of a smartphone. This creates the opportunity for a mobile application to track the speaking participant during a conference call. Conference systems that include this feature use a microphone array to localize sound and who the speaker is. Mobile phones however, are generally not equipped with a microphone array and can therefore not recognize the source of the sound. Instead, this thesis will investigate how the speaker can be determined without using audio input. One method suggests that the speaker can be identified by detecting upper-body motion. This is based on the hypothesis that people often gesture while speaking. This method was implemented as a mobile application and the accuracy was tested. The application uses the camera of a mobile device for input frames and the built-in face detector to retrieve the position of the participants. The bounds of the upper-body for each participant were estimated by using the face bounds from the face detector. The application can then detect and compare motion by calculating the optical flow in each of these bounds. The participant with the most movement will be seen as the speaker. If movement is below a certain threshold value however the application will treat this as if no one is speaking. Lastly this approach was evaluated against a transcribed data-set with videos of two participants. Some variations were also investigated with the aim to increase the accuracy. The application had an accuracy of determining the speaker about 50% of the testing time. However, some alterations were implemented that did improve the results slightly, which indicates that a bigger parameter search could increase the result further. (Less)

- Open Access
- |
- PDF

Links

Document download statistics

Related Materials

Related object is popular science:
Popular Science summary

Please use this url to cite or link to this publication: http://lup.lub.lu.se/student-papers/record/9097048

author

Skoog, Emelie and Freberg, Elin

supervisor

organization

Department of Automatic Control

year

2022

type

H3 - Professional qualifications (4 Years - )

subject

Technology and Engineering

report number

TFRT-6179

ISSN

0280-5316

language

English

id

9097048

date added to LUP

2022-08-12 09:46:23

date last changed

2022-08-12 09:46:23

@misc{9097048,
  abstract     = {{The aim of this thesis is to identify a speaker in a video stream by only using the camera of a smartphone. This creates the opportunity for a mobile application to track the speaking participant during a conference call. Conference systems that include this feature use a microphone array to localize sound and who the speaker is. Mobile phones however, are generally not equipped with a microphone array and can therefore not recognize the source of the sound. Instead, this thesis will investigate how the speaker can be determined without using audio input. One method suggests that the speaker can be identified by detecting upper-body motion. This is based on the hypothesis that people often gesture while speaking. This method was implemented as a mobile application and the accuracy was tested. The application uses the camera of a mobile device for input frames and the built-in face detector to retrieve the position of the participants. The bounds of the upper-body for each participant were estimated by using the face bounds from the face detector. The application can then detect and compare motion by calculating the optical flow in each of these bounds. The participant with the most movement will be seen as the speaker. If movement is below a certain threshold value however the application will treat this as if no one is speaking. Lastly this approach was evaluated against a transcribed data-set with videos of two participants. Some variations were also investigated with the aim to increase the accuracy. The application had an accuracy of determining the speaker about 50% of the testing time. However, some alterations were implemented that did improve the results slightly, which indicates that a bigger parameter search could increase the result further.}},
  author       = {{Skoog, Emelie and Freberg, Elin}},
  issn         = {{0280-5316}},
  language     = {{eng}},
  note         = {{Student Paper}},
  title        = {{Automatic Virtual Tracking in a Multi-Participant Scenario for a Mobile Device}},
  year         = {{2022}},
}

LUP Student Papers

LUND UNIVERSITY LIBRARIES

Automatic Virtual Tracking in a Multi-Participant Scenario for a Mobile Device