Advanced

Deep Neural Networks for Dynamic Visual Data

Priisalu, Maria LU (2016) In Master's Thesis in Mathematical Sciences FMA820 20161
Mathematics (Faculty of Engineering)
Abstract
Given monocular video of people performing daily tasks our objective is to estimate the 3D positions of 32 given joints associated to the human skeleton. Due to the success of deep convolutional networks in image classification, image segmentation and activity recognition we propose to estimate 3D joint positions from video using deep convolutional networks. The modeling is carried out within the framework of convolutional neural networks, and based on the Caffe Deep learning Network. We use the architecture and the pre-trained weights of the convolutional layers of VGG-16, network developed by the Oxford Visual Geometry Group. The effect of different feature extraction architectures on model’s accuracy was studied by varying the number of... (More)
Given monocular video of people performing daily tasks our objective is to estimate the 3D positions of 32 given joints associated to the human skeleton. Due to the success of deep convolutional networks in image classification, image segmentation and activity recognition we propose to estimate 3D joint positions from video using deep convolutional networks. The modeling is carried out within the framework of convolutional neural networks, and based on the Caffe Deep learning Network. We use the architecture and the pre-trained weights of the convolutional layers of VGG-16, network developed by the Oxford Visual Geometry Group. The effect of different feature extraction architectures on model’s accuracy was studied by varying the number of pooling layers. A decreased number of pooling layers did not improve the accuracy of the model. We also studied the effect of varying the output dimension by varying the number of joints estimated simultaneously. Our findings indicate that increasing the number of estimated joint positions does not change model accuracy. Finally the effect of incorporating temporal dependencies by means of Long-Short-Term Memory (LSTM) units in the model was studied. (Less)
Popular Abstract
Gaming platforms such as Kinect have mastered the positioning of the human skeleton by using specialized sensors. By removing the need for specialized hardware, pose estimation could be used on all devices with a video-camera. This thesis discusses a possible method for pose estimation from video.
Estimating the distance to objects from a single image can be a hard task. It is a task mastered by those with partially reduced vision in one eye. It is known that the human mind can learn to estimate the distance to objects from a single image. The question is if computers are capable of doing so also?
To answer the question we turned to a mathematical method called Artificial Neural Networks(ANN). As the name suggests the model resembles the... (More)
Gaming platforms such as Kinect have mastered the positioning of the human skeleton by using specialized sensors. By removing the need for specialized hardware, pose estimation could be used on all devices with a video-camera. This thesis discusses a possible method for pose estimation from video.
Estimating the distance to objects from a single image can be a hard task. It is a task mastered by those with partially reduced vision in one eye. It is known that the human mind can learn to estimate the distance to objects from a single image. The question is if computers are capable of doing so also?
To answer the question we turned to a mathematical method called Artificial Neural Networks(ANN). As the name suggests the model resembles the human nervous system. The model is capable of learning the relationship between the input and the output. The
model’s learning capabilities rely on the model architecture and a number of parameters that are tuned during the engineering phase. We attempted to find a model architecture that enables the model to learn the 3D positions of human joints from video.
ANN’s are made up of small units called neurons. The neurons are stacked on top of each-other creating layers. A network with around 5 or more layers is called deep.
During recent years deep ANN’s have been shown to have good learning capabilities in object recognition from images and activity recognition from videos. In the thesis we adjusted the Oxford Visual Geomtery
Group’s 16-layer network for image recognition in a number of ways for pose-estimation. The model was tested and trained on the Human3.6M data-set. The data-set contains videos of 11 actors performing daily tasks, and the exact 3D poses of the actors throughout the videos. In total the dataset contains 3.6 million frames.
In the small-scale tests the number of layers in the model was decreased. It was noted that decreasing the number of layers led to lower accuracy. Three models were built estimating a different number of joints. The model estimating the position of all of the joints received similar accuracy to a model estimating the location of the joints in arms. The model predicting a single joint’s position outperformed other models but since it takes more than 15 weeks to build the model it is not useable in practice.
Finally the model predicting the pose of the full skeleton was built on the large scale-dataset. The model received a test-error of around 30cm per joint. The best accuracy received on the given dataset is 13cm. Therefore the proposed model did not receive outstanding accuracy. It however showed that deep ANNs can be applied to pose estimation, and with further work they may outperform other methods. (Less)
Please use this url to cite or link to this publication:
author
Priisalu, Maria LU
supervisor
organization
alternative title
Deep Model for Human Pose Estimation from Video
course
FMA820 20161
year
type
H2 - Master's Degree (Two Years)
subject
keywords
Deep Learning, Human pose estimation, Human motion estimation, 3D reconstruction, Long Short-Term Memory(LSTM)
publication/series
Master's Thesis in Mathematical Sciences
report number
LUFTMA-3299-2016
ISSN
1404-6342
other publication id
2016:E26
language
English
id
8883262
date added to LUP
2016-08-25 13:19:03
date last changed
2016-08-25 13:19:03
@misc{8883262,
  abstract     = {Given monocular video of people performing daily tasks our objective is to estimate the 3D positions of 32 given joints associated to the human skeleton. Due to the success of deep convolutional networks in image classification, image segmentation and activity recognition we propose to estimate 3D joint positions from video using deep convolutional networks. The modeling is carried out within the framework of convolutional neural networks, and based on the Caffe Deep learning Network. We use the architecture and the pre-trained weights of the convolutional layers of VGG-16, network developed by the Oxford Visual Geometry Group. The effect of different feature extraction architectures on model’s accuracy was studied by varying the number of pooling layers. A decreased number of pooling layers did not improve the accuracy of the model. We also studied the effect of varying the output dimension by varying the number of joints estimated simultaneously. Our findings indicate that increasing the number of estimated joint positions does not change model accuracy. Finally the effect of incorporating temporal dependencies by means of Long-Short-Term Memory (LSTM) units in the model was studied.},
  author       = {Priisalu, Maria},
  issn         = {1404-6342},
  keyword      = {Deep Learning,Human pose estimation,Human motion estimation,3D reconstruction,Long Short-Term Memory(LSTM)},
  language     = {eng},
  note         = {Student Paper},
  series       = {Master's Thesis in Mathematical Sciences},
  title        = {Deep Neural Networks for Dynamic Visual Data},
  year         = {2016},
}