Deep Neural Networks for Dynamic Visual Data
(2016) In Master's Thesis in Mathematical Sciences FMA820 20161Mathematics (Faculty of Engineering)
 Abstract
 Given monocular video of people performing daily tasks our objective is to estimate the 3D positions of 32 given joints associated to the human skeleton. Due to the success of deep convolutional networks in image classification, image segmentation and activity recognition we propose to estimate 3D joint positions from video using deep convolutional networks. The modeling is carried out within the framework of convolutional neural networks, and based on the Caffe Deep learning Network. We use the architecture and the pretrained weights of the convolutional layers of VGG16, network developed by the Oxford Visual Geometry Group. The effect of different feature extraction architectures on model’s accuracy was studied by varying the number of... (More)
 Given monocular video of people performing daily tasks our objective is to estimate the 3D positions of 32 given joints associated to the human skeleton. Due to the success of deep convolutional networks in image classification, image segmentation and activity recognition we propose to estimate 3D joint positions from video using deep convolutional networks. The modeling is carried out within the framework of convolutional neural networks, and based on the Caffe Deep learning Network. We use the architecture and the pretrained weights of the convolutional layers of VGG16, network developed by the Oxford Visual Geometry Group. The effect of different feature extraction architectures on model’s accuracy was studied by varying the number of pooling layers. A decreased number of pooling layers did not improve the accuracy of the model. We also studied the effect of varying the output dimension by varying the number of joints estimated simultaneously. Our findings indicate that increasing the number of estimated joint positions does not change model accuracy. Finally the effect of incorporating temporal dependencies by means of LongShortTerm Memory (LSTM) units in the model was studied. (Less)
 Popular Abstract
 Gaming platforms such as Kinect have mastered the positioning of the human skeleton by using specialized sensors. By removing the need for specialized hardware, pose estimation could be used on all devices with a videocamera. This thesis discusses a possible method for pose estimation from video.
Estimating the distance to objects from a single image can be a hard task. It is a task mastered by those with partially reduced vision in one eye. It is known that the human mind can learn to estimate the distance to objects from a single image. The question is if computers are capable of doing so also?
To answer the question we turned to a mathematical method called Artificial Neural Networks(ANN). As the name suggests the model resembles the... (More)  Gaming platforms such as Kinect have mastered the positioning of the human skeleton by using specialized sensors. By removing the need for specialized hardware, pose estimation could be used on all devices with a videocamera. This thesis discusses a possible method for pose estimation from video.
Estimating the distance to objects from a single image can be a hard task. It is a task mastered by those with partially reduced vision in one eye. It is known that the human mind can learn to estimate the distance to objects from a single image. The question is if computers are capable of doing so also?
To answer the question we turned to a mathematical method called Artificial Neural Networks(ANN). As the name suggests the model resembles the human nervous system. The model is capable of learning the relationship between the input and the output. The
model’s learning capabilities rely on the model architecture and a number of parameters that are tuned during the engineering phase. We attempted to find a model architecture that enables the model to learn the 3D positions of human joints from video.
ANN’s are made up of small units called neurons. The neurons are stacked on top of eachother creating layers. A network with around 5 or more layers is called deep.
During recent years deep ANN’s have been shown to have good learning capabilities in object recognition from images and activity recognition from videos. In the thesis we adjusted the Oxford Visual Geomtery
Group’s 16layer network for image recognition in a number of ways for poseestimation. The model was tested and trained on the Human3.6M dataset. The dataset contains videos of 11 actors performing daily tasks, and the exact 3D poses of the actors throughout the videos. In total the dataset contains 3.6 million frames.
In the smallscale tests the number of layers in the model was decreased. It was noted that decreasing the number of layers led to lower accuracy. Three models were built estimating a different number of joints. The model estimating the position of all of the joints received similar accuracy to a model estimating the location of the joints in arms. The model predicting a single joint’s position outperformed other models but since it takes more than 15 weeks to build the model it is not useable in practice.
Finally the model predicting the pose of the full skeleton was built on the large scaledataset. The model received a testerror of around 30cm per joint. The best accuracy received on the given dataset is 13cm. Therefore the proposed model did not receive outstanding accuracy. It however showed that deep ANNs can be applied to pose estimation, and with further work they may outperform other methods. (Less)
Please use this url to cite or link to this publication:
http://lup.lub.lu.se/studentpapers/record/8883262
 author
 Priisalu, Maria ^{LU}
 supervisor

 Cristian Sminchisescu ^{LU}
 organization
 alternative title
 Deep Model for Human Pose Estimation from Video
 course
 FMA820 20161
 year
 2016
 type
 H2  Master's Degree (Two Years)
 subject
 keywords
 Deep Learning, Human pose estimation, Human motion estimation, 3D reconstruction, Long ShortTerm Memory(LSTM)
 publication/series
 Master's Thesis in Mathematical Sciences
 report number
 LUFTMA32992016
 ISSN
 14046342
 other publication id
 2016:E26
 language
 English
 id
 8883262
 date added to LUP
 20160825 13:19:03
 date last changed
 20160825 13:19:03
@misc{8883262, abstract = {Given monocular video of people performing daily tasks our objective is to estimate the 3D positions of 32 given joints associated to the human skeleton. Due to the success of deep convolutional networks in image classification, image segmentation and activity recognition we propose to estimate 3D joint positions from video using deep convolutional networks. The modeling is carried out within the framework of convolutional neural networks, and based on the Caffe Deep learning Network. We use the architecture and the pretrained weights of the convolutional layers of VGG16, network developed by the Oxford Visual Geometry Group. The effect of different feature extraction architectures on model’s accuracy was studied by varying the number of pooling layers. A decreased number of pooling layers did not improve the accuracy of the model. We also studied the effect of varying the output dimension by varying the number of joints estimated simultaneously. Our findings indicate that increasing the number of estimated joint positions does not change model accuracy. Finally the effect of incorporating temporal dependencies by means of LongShortTerm Memory (LSTM) units in the model was studied.}, author = {Priisalu, Maria}, issn = {14046342}, keyword = {Deep Learning,Human pose estimation,Human motion estimation,3D reconstruction,Long ShortTerm Memory(LSTM)}, language = {eng}, note = {Student Paper}, series = {Master's Thesis in Mathematical Sciences}, title = {Deep Neural Networks for Dynamic Visual Data}, year = {2016}, }