Automatic Virtual Tracking in a Multi-Participant Scenario for a Mobile Device
(2022)Department of Automatic Control
- Abstract
- The aim of this thesis is to identify a speaker in a video stream by only using the camera of a smartphone. This creates the opportunity for a mobile application to track the speaking participant during a conference call. Conference systems that include this feature use a microphone array to localize sound and who the speaker is. Mobile phones however, are generally not equipped with a microphone array and can therefore not recognize the source of the sound. Instead, this thesis will investigate how the speaker can be determined without using audio input. One method suggests that the speaker can be identified by detecting upper-body motion. This is based on the hypothesis that people often gesture while speaking. This method was... (More)
- The aim of this thesis is to identify a speaker in a video stream by only using the camera of a smartphone. This creates the opportunity for a mobile application to track the speaking participant during a conference call. Conference systems that include this feature use a microphone array to localize sound and who the speaker is. Mobile phones however, are generally not equipped with a microphone array and can therefore not recognize the source of the sound. Instead, this thesis will investigate how the speaker can be determined without using audio input. One method suggests that the speaker can be identified by detecting upper-body motion. This is based on the hypothesis that people often gesture while speaking. This method was implemented as a mobile application and the accuracy was tested. The application uses the camera of a mobile device for input frames and the built-in face detector to retrieve the position of the participants. The bounds of the upper-body for each participant were estimated by using the face bounds from the face detector. The application can then detect and compare motion by calculating the optical flow in each of these bounds. The participant with the most movement will be seen as the speaker. If movement is below a certain threshold value however the application will treat this as if no one is speaking. Lastly this approach was evaluated against a transcribed data-set with videos of two participants. Some variations were also investigated with the aim to increase the accuracy. The application had an accuracy of determining the speaker about 50% of the testing time. However, some alterations were implemented that did improve the results slightly, which indicates that a bigger parameter search could increase the result further. (Less)
Please use this url to cite or link to this publication:
http://lup.lub.lu.se/student-papers/record/9097048
- author
- Skoog, Emelie and Freberg, Elin
- supervisor
- organization
- year
- 2022
- type
- H3 - Professional qualifications (4 Years - )
- subject
- report number
- TFRT-6179
- ISSN
- 0280-5316
- language
- English
- id
- 9097048
- date added to LUP
- 2022-08-12 09:46:23
- date last changed
- 2022-08-12 09:46:23
@misc{9097048, abstract = {{The aim of this thesis is to identify a speaker in a video stream by only using the camera of a smartphone. This creates the opportunity for a mobile application to track the speaking participant during a conference call. Conference systems that include this feature use a microphone array to localize sound and who the speaker is. Mobile phones however, are generally not equipped with a microphone array and can therefore not recognize the source of the sound. Instead, this thesis will investigate how the speaker can be determined without using audio input. One method suggests that the speaker can be identified by detecting upper-body motion. This is based on the hypothesis that people often gesture while speaking. This method was implemented as a mobile application and the accuracy was tested. The application uses the camera of a mobile device for input frames and the built-in face detector to retrieve the position of the participants. The bounds of the upper-body for each participant were estimated by using the face bounds from the face detector. The application can then detect and compare motion by calculating the optical flow in each of these bounds. The participant with the most movement will be seen as the speaker. If movement is below a certain threshold value however the application will treat this as if no one is speaking. Lastly this approach was evaluated against a transcribed data-set with videos of two participants. Some variations were also investigated with the aim to increase the accuracy. The application had an accuracy of determining the speaker about 50% of the testing time. However, some alterations were implemented that did improve the results slightly, which indicates that a bigger parameter search could increase the result further.}}, author = {{Skoog, Emelie and Freberg, Elin}}, issn = {{0280-5316}}, language = {{eng}}, note = {{Student Paper}}, title = {{Automatic Virtual Tracking in a Multi-Participant Scenario for a Mobile Device}}, year = {{2022}}, }