Skip to main content

Lund University Publications

LUND UNIVERSITY LIBRARIES

Reinforcement Learning for Active Visual Perception

Pirinen, Aleksis LU (2021)
Abstract
Visual perception refers to automatically recognizing, detecting, or otherwise sensing the content of an image, video or scene. The most common contemporary approach to tackle a visual perception task is by training a deep neural network on a pre-existing dataset which provides examples of task success and failure, respectively. Despite remarkable recent progress across a wide range of vision tasks, many standard methodologies are static in that they lack mechanisms for adapting to any particular settings or constraints of the task at hand. The ability to adapt is desirable in many practical scenarios, since the operating regime often differs from the training setup. For example, a robot which has learnt to recognize a static set of... (More)
Visual perception refers to automatically recognizing, detecting, or otherwise sensing the content of an image, video or scene. The most common contemporary approach to tackle a visual perception task is by training a deep neural network on a pre-existing dataset which provides examples of task success and failure, respectively. Despite remarkable recent progress across a wide range of vision tasks, many standard methodologies are static in that they lack mechanisms for adapting to any particular settings or constraints of the task at hand. The ability to adapt is desirable in many practical scenarios, since the operating regime often differs from the training setup. For example, a robot which has learnt to recognize a static set of training images may perform poorly in real-world settings, where it may view objects from unusual angles or explore poorly illuminated environments. The robot should then ideally be able to actively position itself to observe the scene from viewpoints where it is more confident, or refine its perception with only a limited amount of training data for its present operating conditions.

In this thesis we demonstrate how reinforcement learning (RL) can be integrated with three fundamental visual perception tasks -- object detection, human pose estimation, and semantic segmentation -- in order to make the resulting pipelines more adaptive, accurate and/or faster. In the first part we provide object detectors with the capacity to actively select what parts of a given image to analyze and when to terminate the detection process. Several ideas are proposed and empirically evaluated, such as explicitly including the speed-accuracy trade-off in the training process, which makes it possible to specify this trade-off during inference. In the second part we consider active multi-view 3d human pose estimation in complex scenarios with multiple people. We explore this in two different contexts: i) active triangulation, which requires carefully observing each body joint from multiple viewpoints, and ii) active viewpoint selection for monocular 3d estimators, which requires considering which viewpoints yield accurate fused estimates when combined. In both settings the viewpoint selection systems face several challenges, such as partial observability resulting e.g. from occlusions. We show that RL-based methods outperform heuristic ones in accuracy, with negligible computational overhead. Finally, the thesis concludes with establishing a framework for embodied visual active learning in the context of semantic segmentation, where an agent should explore a 3d environment and actively query annotations to refine its visual perception. Our empirical results suggest that reinforcement learning can be successfully applied within this framework as well. (Less)
Please use this url to cite or link to this publication:
author
supervisor
opponent
  • Prof. Kjellström, Hedvig, KTH Royal Institute of Technology, Sweden.
organization
alternative title
Aktiv visuell perception via förstärkningsinlärning
publishing date
type
Thesis
publication status
published
subject
keywords
computer vision, reinforcement learning, deep learning, active vision, object detection, human pose estimation, semantic segmentation
pages
219 pages
publisher
Lund University / Centre for Mathematical Sciences /LTH
defense location
Lecture hall MH:Hörmander, Centre of Mathematical Sciences, Sölvegatan 18, Faculty of Engineering LTH, Lund University, Lund. Zoom: https://lu-se.zoom.us/j/67213391794?pwd=WE1ZOE9KNlZIbTZvYnFhSlVqWU1tZz09
defense date
2021-06-10 13:15:00
ISBN
978-91-7895-795-8
978-91-7895-796-5
language
English
LU publication?
yes
id
6065e35e-b97b-44b8-97b0-a04fe3862a13
date added to LUP
2021-05-12 16:54:06
date last changed
2022-04-07 09:39:20
@phdthesis{6065e35e-b97b-44b8-97b0-a04fe3862a13,
  abstract     = {{Visual perception refers to automatically recognizing, detecting, or otherwise sensing the content of an image, video or scene. The most common contemporary approach to tackle a visual perception task is by training a deep neural network on a pre-existing dataset which provides examples of task success and failure, respectively. Despite remarkable recent progress across a wide range of vision tasks, many standard methodologies are static in that they lack mechanisms for adapting to any particular settings or constraints of the task at hand. The ability to adapt is desirable in many practical scenarios, since the operating regime often differs from the training setup. For example, a robot which has learnt to recognize a static set of training images may perform poorly in real-world settings, where it may view objects from unusual angles or explore poorly illuminated environments. The robot should then ideally be able to actively position itself to observe the scene from viewpoints where it is more confident, or refine its perception with only a limited amount of training data for its present operating conditions.<br/><br/>In this thesis we demonstrate how reinforcement learning (RL) can be integrated with three fundamental visual perception tasks -- object detection, human pose estimation, and semantic segmentation -- in order to make the resulting pipelines more adaptive, accurate and/or faster. In the first part we provide object detectors with the capacity to actively select what parts of a given image to analyze and when to terminate the detection process. Several ideas are proposed and empirically evaluated, such as explicitly including the speed-accuracy trade-off in the training process, which makes it possible to specify this trade-off during inference. In the second part we consider active multi-view 3d human pose estimation in complex scenarios with multiple people. We explore this in two different contexts: i) active triangulation, which requires carefully observing each body joint from multiple viewpoints, and ii) active viewpoint selection for monocular 3d estimators, which requires considering which viewpoints yield accurate fused estimates when combined. In both settings the viewpoint selection systems face several challenges, such as partial observability resulting e.g. from occlusions. We show that RL-based methods outperform heuristic ones in accuracy, with negligible computational overhead. Finally, the thesis concludes with establishing a framework for embodied visual active learning in the context of semantic segmentation, where an agent should explore a 3d environment and actively query annotations to refine its visual perception. Our empirical results suggest that reinforcement learning can be successfully applied within this framework as well.}},
  author       = {{Pirinen, Aleksis}},
  isbn         = {{978-91-7895-795-8}},
  keywords     = {{computer vision; reinforcement learning; deep learning; active vision; object detection; human pose estimation; semantic segmentation}},
  language     = {{eng}},
  publisher    = {{Lund University / Centre for Mathematical Sciences /LTH}},
  school       = {{Lund University}},
  title        = {{Reinforcement Learning for Active Visual Perception}},
  url          = {{https://lup.lub.lu.se/search/files/97743536/aleksis_phd_thesis.pdf}},
  year         = {{2021}},
}