Reinforcement Learning for Active Visual Perception

Pirinen, Aleksis

Reinforcement Learning for Active Visual Perception

Mark

Pirinen, Aleksis ^LU (2021)

Abstract: Visual perception refers to automatically recognizing, detecting, or otherwise sensing the content of an image, video or scene. The most common contemporary approach to tackle a visual perception task is by training a deep neural network on a pre-existing dataset which provides examples of task success and failure, respectively. Despite remarkable recent progress across a wide range of vision tasks, many standard methodologies are static in that they lack mechanisms for adapting to any particular settings or constraints of the task at hand. The ability to adapt is desirable in many practical scenarios, since the operating regime often differs from the training setup. For example, a robot which has learnt to recognize a static set of... (More); Visual perception refers to automatically recognizing, detecting, or otherwise sensing the content of an image, video or scene. The most common contemporary approach to tackle a visual perception task is by training a deep neural network on a pre-existing dataset which provides examples of task success and failure, respectively. Despite remarkable recent progress across a wide range of vision tasks, many standard methodologies are static in that they lack mechanisms for adapting to any particular settings or constraints of the task at hand. The ability to adapt is desirable in many practical scenarios, since the operating regime often differs from the training setup. For example, a robot which has learnt to recognize a static set of training images may perform poorly in real-world settings, where it may view objects from unusual angles or explore poorly illuminated environments. The robot should then ideally be able to actively position itself to observe the scene from viewpoints where it is more confident, or refine its perception with only a limited amount of training data for its present operating conditions.

In this thesis we demonstrate how reinforcement learning (RL) can be integrated with three fundamental visual perception tasks -- object detection, human pose estimation, and semantic segmentation -- in order to make the resulting pipelines more adaptive, accurate and/or faster. In the first part we provide object detectors with the capacity to actively select what parts of a given image to analyze and when to terminate the detection process. Several ideas are proposed and empirically evaluated, such as explicitly including the speed-accuracy trade-off in the training process, which makes it possible to specify this trade-off during inference. In the second part we consider active multi-view 3d human pose estimation in complex scenarios with multiple people. We explore this in two different contexts: i) active triangulation, which requires carefully observing each body joint from multiple viewpoints, and ii) active viewpoint selection for monocular 3d estimators, which requires considering which viewpoints yield accurate fused estimates when combined. In both settings the viewpoint selection systems face several challenges, such as partial observability resulting e.g. from occlusions. We show that RL-based methods outperform heuristic ones in accuracy, with negligible computational overhead. Finally, the thesis concludes with establishing a framework for embodied visual active learning in the context of semantic segmentation, where an agent should explore a 3d environment and actively query annotations to refine its visual perception. Our empirical results suggest that reinforcement learning can be successfully applied within this framework as well. (Less)

Please use this url to cite or link to this publication: https://lup.lub.lu.se/record/6065e35e-b97b-44b8-97b0-a04fe3862a13

author

Pirinen, Aleksis ^LU

supervisor

Cristian Sminchisescu ^LU
Carl Olsson ^LU

opponent

Prof. Kjellström, Hedvig, KTH Royal Institute of Technology, Sweden.

organization

alternative title

Aktiv visuell perception via förstärkningsinlärning

publishing date

2021

type

Thesis

publication status

published

subject

Computer graphics and computer vision

keywords

computer vision, reinforcement learning, deep learning, active vision, object detection, human pose estimation, semantic segmentation

pages

219 pages

publisher

Lund University / Centre for Mathematical Sciences /LTH

defense location

Lecture hall MH:Hörmander, Centre of Mathematical Sciences, Sölvegatan 18, Faculty of Engineering LTH, Lund University, Lund. Zoom: https://lu-se.zoom.us/j/67213391794?pwd=WE1ZOE9KNlZIbTZvYnFhSlVqWU1tZz09

defense date

2021-06-10 13:15:00

ISBN

978-91-7895-796-5

978-91-7895-795-8

language

English

LU publication?

yes

id

6065e35e-b97b-44b8-97b0-a04fe3862a13

date added to LUP

2021-05-12 16:54:06

date last changed

2025-04-04 14:00:21

@phdthesis{6065e35e-b97b-44b8-97b0-a04fe3862a13,
  abstract     = {{Visual perception refers to automatically recognizing, detecting, or otherwise sensing the content of an image, video or scene. The most common contemporary approach to tackle a visual perception task is by training a deep neural network on a pre-existing dataset which provides examples of task success and failure, respectively. Despite remarkable recent progress across a wide range of vision tasks, many standard methodologies are static in that they lack mechanisms for adapting to any particular settings or constraints of the task at hand. The ability to adapt is desirable in many practical scenarios, since the operating regime often differs from the training setup. For example, a robot which has learnt to recognize a static set of training images may perform poorly in real-world settings, where it may view objects from unusual angles or explore poorly illuminated environments. The robot should then ideally be able to actively position itself to observe the scene from viewpoints where it is more confident, or refine its perception with only a limited amount of training data for its present operating conditions.<br/><br/>In this thesis we demonstrate how reinforcement learning (RL) can be integrated with three fundamental visual perception tasks -- object detection, human pose estimation, and semantic segmentation -- in order to make the resulting pipelines more adaptive, accurate and/or faster. In the first part we provide object detectors with the capacity to actively select what parts of a given image to analyze and when to terminate the detection process. Several ideas are proposed and empirically evaluated, such as explicitly including the speed-accuracy trade-off in the training process, which makes it possible to specify this trade-off during inference. In the second part we consider active multi-view 3d human pose estimation in complex scenarios with multiple people. We explore this in two different contexts: i) active triangulation, which requires carefully observing each body joint from multiple viewpoints, and ii) active viewpoint selection for monocular 3d estimators, which requires considering which viewpoints yield accurate fused estimates when combined. In both settings the viewpoint selection systems face several challenges, such as partial observability resulting e.g. from occlusions. We show that RL-based methods outperform heuristic ones in accuracy, with negligible computational overhead. Finally, the thesis concludes with establishing a framework for embodied visual active learning in the context of semantic segmentation, where an agent should explore a 3d environment and actively query annotations to refine its visual perception. Our empirical results suggest that reinforcement learning can be successfully applied within this framework as well.}},
  author       = {{Pirinen, Aleksis}},
  isbn         = {{978-91-7895-796-5}},
  keywords     = {{computer vision; reinforcement learning; deep learning; active vision; object detection; human pose estimation; semantic segmentation}},
  language     = {{eng}},
  publisher    = {{Lund University / Centre for Mathematical Sciences /LTH}},
  school       = {{Lund University}},
  title        = {{Reinforcement Learning for Active Visual Perception}},
  url          = {{https://lup.lub.lu.se/search/files/97743536/aleksis_phd_thesis.pdf}},
  year         = {{2021}},
}

Lund University Publications

LUND UNIVERSITY LIBRARIES

Reinforcement Learning for Active Visual Perception