# LUP Student Papers

## LUND UNIVERSITY LIBRARIES

### Self-supervised monocular depth estimation for dynamic scenes

(2021) In Master's Theses in Mathematical Sciences FMAM05 20211
Mathematics (Faculty of Engineering)
Abstract
Estimating depth from an image is an ill-fitted problem, since we project a three-dimensional space to a two-dimensional image. However, human can estimate a plausible depth from only a single eye and often are we relaying on this for quick estimation of moving objects. Although machine learning has shown to be powerful in computer vision tasks, the process of constructing deep learning models is often connected to collecting large amount of meaningful data. Acquiring per-pixel depth data for different scenes is expensive, tedious and sparse, so a naive approach to use regression or supervised models does not scale when using data-driven development.

Self-supervised depth estimation from monocular data try to instead use only data from... (More)
Estimating depth from an image is an ill-fitted problem, since we project a three-dimensional space to a two-dimensional image. However, human can estimate a plausible depth from only a single eye and often are we relaying on this for quick estimation of moving objects. Although machine learning has shown to be powerful in computer vision tasks, the process of constructing deep learning models is often connected to collecting large amount of meaningful data. Acquiring per-pixel depth data for different scenes is expensive, tedious and sparse, so a naive approach to use regression or supervised models does not scale when using data-driven development.

Self-supervised depth estimation from monocular data try to instead use only data from a camera stream to estimate depth by using view-synthesis and change the problem of regression to an image reconstruction task. The image reconstruction will then lead indirectly to a per-pixel depth estimation from a single image. This approach is great for a number of various reasons but relies on several different assumptions. One of those is a static scene assumption, which often fails on real life data that is collected from a video camera mounted on a moving car in dynamic scenes. In this thesis, the goal is to investigate self-supervised monocular depth approaches in dynamic scenes. Different methods are applied and evaluated that address the inaccuracies in depth estimation that moving objects causes.

State-of-the-art self-supervised monocular depth approaches studied in this thesis consists of handling the depth estimation through view-synthesis. This create a reliance on correct pixel correspondence from one frame to another for the network to accurately estimate the depth. Motion of objects in dynamic scenes causes the pixel mapping to be incorrect which affects the accuracy of depth estimation in those areas. The best way to handle this was found out to be to remove the pixels where motion has happened from the deep learning network's weights update scheme. This causes the network to not punish accurate depth estimation even though the pixel mapping was incorrect. The most efficient way to find problematic areas was found out to be with a forward-backward optical flow consistency check. (Less)
author
supervisor
organization
course
FMAM05 20211
year
type
H2 - Master's Degree (Two Years)
subject
publication/series
Master's Theses in Mathematical Sciences
report number
LUTFMA-3452-2021
ISSN
1404-6342
other publication id
2021:E35
language
English
id
9060905
2021-08-27 15:07:04
date last changed
2021-08-27 15:07:04
```@misc{9060905,
abstract     = {{Estimating depth from an image is an ill-fitted problem, since we project a three-dimensional space to a two-dimensional image. However, human can estimate a plausible depth from only a single eye and often are we relaying on this for quick estimation of moving objects. Although machine learning has shown to be powerful in computer vision tasks, the process of constructing deep learning models is often connected to collecting large amount of meaningful data. Acquiring per-pixel depth data for different scenes is expensive, tedious and sparse, so a naive approach to use regression or supervised models does not scale when using data-driven development.

Self-supervised depth estimation from monocular data try to instead use only data from a camera stream to estimate depth by using view-synthesis and change the problem of regression to an image reconstruction task. The image reconstruction will then lead indirectly to a per-pixel depth estimation from a single image. This approach is great for a number of various reasons but relies on several different assumptions. One of those is a static scene assumption, which often fails on real life data that is collected from a video camera mounted on a moving car in dynamic scenes. In this thesis, the goal is to investigate self-supervised monocular depth approaches in dynamic scenes. Different methods are applied and evaluated that address the inaccuracies in depth estimation that moving objects causes.

State-of-the-art self-supervised monocular depth approaches studied in this thesis consists of handling the depth estimation through view-synthesis. This create a reliance on correct pixel correspondence from one frame to another for the network to accurately estimate the depth. Motion of objects in dynamic scenes causes the pixel mapping to be incorrect which affects the accuracy of depth estimation in those areas. The best way to handle this was found out to be to remove the pixels where motion has happened from the deep learning network's weights update scheme. This causes the network to not punish accurate depth estimation even though the pixel mapping was incorrect. The most efficient way to find problematic areas was found out to be with a forward-backward optical flow consistency check.}},
author       = {{Lindberg, Jonathan}},
issn         = {{1404-6342}},
language     = {{eng}},
note         = {{Student Paper}},
series       = {{Master's Theses in Mathematical Sciences}},
title        = {{Self-supervised monocular depth estimation for dynamic scenes}},
year         = {{2021}},
}

```