Self-supervised monocular depth estimation for dynamic scenes

Lindberg, Jonathan

Self-supervised monocular depth estimation for dynamic scenes

Mark

Lindberg, Jonathan ^LU (2021) In Master's Theses in Mathematical Sciences FMAM05 20211
Mathematics (Faculty of Engineering)

Abstract: Estimating depth from an image is an ill-fitted problem, since we project a three-dimensional space to a two-dimensional image. However, human can estimate a plausible depth from only a single eye and often are we relaying on this for quick estimation of moving objects. Although machine learning has shown to be powerful in computer vision tasks, the process of constructing deep learning models is often connected to collecting large amount of meaningful data. Acquiring per-pixel depth data for different scenes is expensive, tedious and sparse, so a naive approach to use regression or supervised models does not scale when using data-driven development.

Self-supervised depth estimation from monocular data try to instead use only data from... (More); Estimating depth from an image is an ill-fitted problem, since we project a three-dimensional space to a two-dimensional image. However, human can estimate a plausible depth from only a single eye and often are we relaying on this for quick estimation of moving objects. Although machine learning has shown to be powerful in computer vision tasks, the process of constructing deep learning models is often connected to collecting large amount of meaningful data. Acquiring per-pixel depth data for different scenes is expensive, tedious and sparse, so a naive approach to use regression or supervised models does not scale when using data-driven development.

Self-supervised depth estimation from monocular data try to instead use only data from a camera stream to estimate depth by using view-synthesis and change the problem of regression to an image reconstruction task. The image reconstruction will then lead indirectly to a per-pixel depth estimation from a single image. This approach is great for a number of various reasons but relies on several different assumptions. One of those is a static scene assumption, which often fails on real life data that is collected from a video camera mounted on a moving car in dynamic scenes. In this thesis, the goal is to investigate self-supervised monocular depth approaches in dynamic scenes. Different methods are applied and evaluated that address the inaccuracies in depth estimation that moving objects causes.

State-of-the-art self-supervised monocular depth approaches studied in this thesis consists of handling the depth estimation through view-synthesis. This create a reliance on correct pixel correspondence from one frame to another for the network to accurately estimate the depth. Motion of objects in dynamic scenes causes the pixel mapping to be incorrect which affects the accuracy of depth estimation in those areas. The best way to handle this was found out to be to remove the pixels where motion has happened from the deep learning network's weights update scheme. This causes the network to not punish accurate depth estimation even though the pixel mapping was incorrect. The most efficient way to find problematic areas was found out to be with a forward-backward optical flow consistency check. (Less)

- Open Access
- |
- PDF

Links

Document download statistics

Related Materials

Related object is popular science:
Popular science summary

Please use this url to cite or link to this publication: http://lup.lub.lu.se/student-papers/record/9060905

author

Lindberg, Jonathan ^LU

supervisor

Amer Mustajbasic
Anders Heyden ^LU

organization

Mathematics (Faculty of Engineering)

course

FMAM05 20211

year

2021

type

H2 - Master's Degree (Two Years)

subject

Mathematics and Statistics

publication/series

Master's Theses in Mathematical Sciences

report number

LUTFMA-3452-2021

ISSN

1404-6342

other publication id

2021:E35

language

English

id

9060905

date added to LUP

2021-08-27 15:07:04

date last changed

2021-08-27 15:07:04

@misc{9060905,
  abstract     = {{Estimating depth from an image is an ill-fitted problem, since we project a three-dimensional space to a two-dimensional image. However, human can estimate a plausible depth from only a single eye and often are we relaying on this for quick estimation of moving objects. Although machine learning has shown to be powerful in computer vision tasks, the process of constructing deep learning models is often connected to collecting large amount of meaningful data. Acquiring per-pixel depth data for different scenes is expensive, tedious and sparse, so a naive approach to use regression or supervised models does not scale when using data-driven development.

Self-supervised depth estimation from monocular data try to instead use only data from a camera stream to estimate depth by using view-synthesis and change the problem of regression to an image reconstruction task. The image reconstruction will then lead indirectly to a per-pixel depth estimation from a single image. This approach is great for a number of various reasons but relies on several different assumptions. One of those is a static scene assumption, which often fails on real life data that is collected from a video camera mounted on a moving car in dynamic scenes. In this thesis, the goal is to investigate self-supervised monocular depth approaches in dynamic scenes. Different methods are applied and evaluated that address the inaccuracies in depth estimation that moving objects causes. 

State-of-the-art self-supervised monocular depth approaches studied in this thesis consists of handling the depth estimation through view-synthesis. This create a reliance on correct pixel correspondence from one frame to another for the network to accurately estimate the depth. Motion of objects in dynamic scenes causes the pixel mapping to be incorrect which affects the accuracy of depth estimation in those areas. The best way to handle this was found out to be to remove the pixels where motion has happened from the deep learning network's weights update scheme. This causes the network to not punish accurate depth estimation even though the pixel mapping was incorrect. The most efficient way to find problematic areas was found out to be with a forward-backward optical flow consistency check.}},
  author       = {{Lindberg, Jonathan}},
  issn         = {{1404-6342}},
  language     = {{eng}},
  note         = {{Student Paper}},
  series       = {{Master's Theses in Mathematical Sciences}},
  title        = {{Self-supervised monocular depth estimation for dynamic scenes}},
  year         = {{2021}},
}

LUP Student Papers

LUND UNIVERSITY LIBRARIES

Self-supervised monocular depth estimation for dynamic scenes