Reinforcement Learning for Active Visual Perception
(2021)- Abstract
- Visual perception refers to automatically recognizing, detecting, or otherwise sensing the content of an image, video or scene. The most common contemporary approach to tackle a visual perception task is by training a deep neural network on a pre-existing dataset which provides examples of task success and failure, respectively. Despite remarkable recent progress across a wide range of vision tasks, many standard methodologies are static in that they lack mechanisms for adapting to any particular settings or constraints of the task at hand. The ability to adapt is desirable in many practical scenarios, since the operating regime often differs from the training setup. For example, a robot which has learnt to recognize a static set of... (More)
- Visual perception refers to automatically recognizing, detecting, or otherwise sensing the content of an image, video or scene. The most common contemporary approach to tackle a visual perception task is by training a deep neural network on a pre-existing dataset which provides examples of task success and failure, respectively. Despite remarkable recent progress across a wide range of vision tasks, many standard methodologies are static in that they lack mechanisms for adapting to any particular settings or constraints of the task at hand. The ability to adapt is desirable in many practical scenarios, since the operating regime often differs from the training setup. For example, a robot which has learnt to recognize a static set of training images may perform poorly in real-world settings, where it may view objects from unusual angles or explore poorly illuminated environments. The robot should then ideally be able to actively position itself to observe the scene from viewpoints where it is more confident, or refine its perception with only a limited amount of training data for its present operating conditions.
In this thesis we demonstrate how reinforcement learning (RL) can be integrated with three fundamental visual perception tasks -- object detection, human pose estimation, and semantic segmentation -- in order to make the resulting pipelines more adaptive, accurate and/or faster. In the first part we provide object detectors with the capacity to actively select what parts of a given image to analyze and when to terminate the detection process. Several ideas are proposed and empirically evaluated, such as explicitly including the speed-accuracy trade-off in the training process, which makes it possible to specify this trade-off during inference. In the second part we consider active multi-view 3d human pose estimation in complex scenarios with multiple people. We explore this in two different contexts: i) active triangulation, which requires carefully observing each body joint from multiple viewpoints, and ii) active viewpoint selection for monocular 3d estimators, which requires considering which viewpoints yield accurate fused estimates when combined. In both settings the viewpoint selection systems face several challenges, such as partial observability resulting e.g. from occlusions. We show that RL-based methods outperform heuristic ones in accuracy, with negligible computational overhead. Finally, the thesis concludes with establishing a framework for embodied visual active learning in the context of semantic segmentation, where an agent should explore a 3d environment and actively query annotations to refine its visual perception. Our empirical results suggest that reinforcement learning can be successfully applied within this framework as well. (Less)
Please use this url to cite or link to this publication:
https://lup.lub.lu.se/record/6065e35e-b97b-44b8-97b0-a04fe3862a13
- author
- Pirinen, Aleksis LU
- supervisor
- opponent
-
- Prof. Kjellström, Hedvig, KTH Royal Institute of Technology, Sweden.
- organization
- alternative title
- Aktiv visuell perception via förstärkningsinlärning
- publishing date
- 2021
- type
- Thesis
- publication status
- published
- subject
- keywords
- computer vision, reinforcement learning, deep learning, active vision, object detection, human pose estimation, semantic segmentation
- pages
- 219 pages
- publisher
- Lund University / Centre for Mathematical Sciences /LTH
- defense location
- Lecture hall MH:Hörmander, Centre of Mathematical Sciences, Sölvegatan 18, Faculty of Engineering LTH, Lund University, Lund. Zoom: https://lu-se.zoom.us/j/67213391794?pwd=WE1ZOE9KNlZIbTZvYnFhSlVqWU1tZz09
- defense date
- 2021-06-10 13:15:00
- ISBN
- 978-91-7895-796-5
- 978-91-7895-795-8
- language
- English
- LU publication?
- yes
- id
- 6065e35e-b97b-44b8-97b0-a04fe3862a13
- date added to LUP
- 2021-05-12 16:54:06
- date last changed
- 2022-04-07 09:39:20
@phdthesis{6065e35e-b97b-44b8-97b0-a04fe3862a13, abstract = {{Visual perception refers to automatically recognizing, detecting, or otherwise sensing the content of an image, video or scene. The most common contemporary approach to tackle a visual perception task is by training a deep neural network on a pre-existing dataset which provides examples of task success and failure, respectively. Despite remarkable recent progress across a wide range of vision tasks, many standard methodologies are static in that they lack mechanisms for adapting to any particular settings or constraints of the task at hand. The ability to adapt is desirable in many practical scenarios, since the operating regime often differs from the training setup. For example, a robot which has learnt to recognize a static set of training images may perform poorly in real-world settings, where it may view objects from unusual angles or explore poorly illuminated environments. The robot should then ideally be able to actively position itself to observe the scene from viewpoints where it is more confident, or refine its perception with only a limited amount of training data for its present operating conditions.<br/><br/>In this thesis we demonstrate how reinforcement learning (RL) can be integrated with three fundamental visual perception tasks -- object detection, human pose estimation, and semantic segmentation -- in order to make the resulting pipelines more adaptive, accurate and/or faster. In the first part we provide object detectors with the capacity to actively select what parts of a given image to analyze and when to terminate the detection process. Several ideas are proposed and empirically evaluated, such as explicitly including the speed-accuracy trade-off in the training process, which makes it possible to specify this trade-off during inference. In the second part we consider active multi-view 3d human pose estimation in complex scenarios with multiple people. We explore this in two different contexts: i) active triangulation, which requires carefully observing each body joint from multiple viewpoints, and ii) active viewpoint selection for monocular 3d estimators, which requires considering which viewpoints yield accurate fused estimates when combined. In both settings the viewpoint selection systems face several challenges, such as partial observability resulting e.g. from occlusions. We show that RL-based methods outperform heuristic ones in accuracy, with negligible computational overhead. Finally, the thesis concludes with establishing a framework for embodied visual active learning in the context of semantic segmentation, where an agent should explore a 3d environment and actively query annotations to refine its visual perception. Our empirical results suggest that reinforcement learning can be successfully applied within this framework as well.}}, author = {{Pirinen, Aleksis}}, isbn = {{978-91-7895-796-5}}, keywords = {{computer vision; reinforcement learning; deep learning; active vision; object detection; human pose estimation; semantic segmentation}}, language = {{eng}}, publisher = {{Lund University / Centre for Mathematical Sciences /LTH}}, school = {{Lund University}}, title = {{Reinforcement Learning for Active Visual Perception}}, url = {{https://lup.lub.lu.se/search/files/97743536/aleksis_phd_thesis.pdf}}, year = {{2021}}, }