Sparse Multi-View Computer Vision for 3D Human and Scene Understanding
(2025)- Abstract
- Perceiving and understanding human motion is a fundamental problem in computer vision, with diverse applications encompassing sports analytics, healthcare monitoring, entertainment, and intelligent interactive systems. Multi-camera systems, by capturing multiple viewpoints simultaneously, enable robust tracking and reconstruction of human poses in 3D, overcoming limitations of single-view approaches. This thesis addresses key bottlenecks encountered when designing and deploying multi-camera systems for 3D human and scene understanding beyond controlled laboratory settings.
Paper I introduces a human-pose-based approach to extrinsic camera calibration that leverages naturally occurring human motion in the scene. By incorporating a... (More) - Perceiving and understanding human motion is a fundamental problem in computer vision, with diverse applications encompassing sports analytics, healthcare monitoring, entertainment, and intelligent interactive systems. Multi-camera systems, by capturing multiple viewpoints simultaneously, enable robust tracking and reconstruction of human poses in 3D, overcoming limitations of single-view approaches. This thesis addresses key bottlenecks encountered when designing and deploying multi-camera systems for 3D human and scene understanding beyond controlled laboratory settings.
Paper I introduces a human-pose-based approach to extrinsic camera calibration that leverages naturally occurring human motion in the scene. By incorporating a 3D pose likelihood model in kinematic chain space and a distance-aware confidence-weighted reprojection loss, we enable accurate wide-baseline calibration without calibration equipment. This allows for rapid deployment and reconfiguration of multi-camera systems without requiring technical expertise.
The reliance on large labeled datasets presents a significant obstacle to the widespread adoption of action recognition systems. In Paper II we propose a self-supervised learning framework for skeleton-based action recognition. We adapted Bootstrap Your Own Latent (BYOL) for 3D human pose sequence representation. Our contributions include multi-viewpoint sampling that leverages existing multi-camera data, and asymmetric augmentation pipelines bridging the domain shift gap when fine-tuning the network for downstream tasks. This self-supervised method reduces the need for labeled data, shortening development time for new applications.
Paper III focuses on robust 3D human pose reconstruction, particularly in challenging real-world scenarios. Triangulation-based methods struggle in occluded or sparsely-covered scenes. We designed an encoder-decoder Transformer model that regresses 3D human poses from multi-view 2D pose sequences, and introduced a biased attention mechanism that leverages geometric relationships between views and detection confidence scores. Our approach enables robust reconstruction of 3D human poses under heavy occlusion and when few input views are available.
In Paper IV, we tackle open-vocabulary 3D object detection from sparse multi-view RGB data. Our approach builds on pre-trained, off-the-shelf 2D networks and does not require retraining. We lift 2D detections into 3D via monocular depth estimation, followed by multi-view feature consistency optimization and 3D fusion of sparse proposals. Our experiments show that this approach can produce comparable results to state-of-the-art methods in the densely sampled setting while significantly outperforming the state-of-the-art for instances with sparse-views. (Less)
Please use this url to cite or link to this publication:
https://lup.lub.lu.se/record/df5a72c8-4033-461f-846e-d9c721067da3
- author
- Moliner, Olivier
LU
- supervisor
- opponent
-
- Prof. Rhodin, Helge, Bielefeld University, Germany.
- organization
- publishing date
- 2025
- type
- Thesis
- publication status
- published
- subject
- keywords
- Multi-view Geometry, Extrinsic Camera Calibration, Multi-camera System, 3D Human Pose Estimation, Skeleton-based Action Recognition, Self-supervised Learning, 3D Object Detection, 3D Scene Understanding
- publisher
- Lund University / Centre for Mathematical Sciences /LTH
- defense location
- Lecture Hall MH:Hörmander, Centre of Mathematical Sciences, Märkesbacken 4, Faculty of Engineering LTH, Lund University, Lund. The dissertation will be live streamed, but part of the premises is to be excluded from the live stream. Zoom: https://lu-se.zoom.us/j/69406882444?pwd=PQGCrAosqGNabtGs5pAxec2bQraJaO.1
- defense date
- 2025-10-10 13:15:00
- ISBN
- 978-91-8104-604-5
- 978-91-8104-605-2
- language
- English
- LU publication?
- yes
- id
- df5a72c8-4033-461f-846e-d9c721067da3
- date added to LUP
- 2025-09-14 15:18:40
- date last changed
- 2025-09-18 03:24:48
@phdthesis{df5a72c8-4033-461f-846e-d9c721067da3, abstract = {{Perceiving and understanding human motion is a fundamental problem in computer vision, with diverse applications encompassing sports analytics, healthcare monitoring, entertainment, and intelligent interactive systems. Multi-camera systems, by capturing multiple viewpoints simultaneously, enable robust tracking and reconstruction of human poses in 3D, overcoming limitations of single-view approaches. This thesis addresses key bottlenecks encountered when designing and deploying multi-camera systems for 3D human and scene understanding beyond controlled laboratory settings.<br/><br/>Paper I introduces a human-pose-based approach to extrinsic camera calibration that leverages naturally occurring human motion in the scene. By incorporating a 3D pose likelihood model in kinematic chain space and a distance-aware confidence-weighted reprojection loss, we enable accurate wide-baseline calibration without calibration equipment. This allows for rapid deployment and reconfiguration of multi-camera systems without requiring technical expertise.<br/><br/>The reliance on large labeled datasets presents a significant obstacle to the widespread adoption of action recognition systems. In Paper II we propose a self-supervised learning framework for skeleton-based action recognition. We adapted Bootstrap Your Own Latent (BYOL) for 3D human pose sequence representation. Our contributions include multi-viewpoint sampling that leverages existing multi-camera data, and asymmetric augmentation pipelines bridging the domain shift gap when fine-tuning the network for downstream tasks. This self-supervised method reduces the need for labeled data, shortening development time for new applications.<br/><br/>Paper III focuses on robust 3D human pose reconstruction, particularly in challenging real-world scenarios. Triangulation-based methods struggle in occluded or sparsely-covered scenes. We designed an encoder-decoder Transformer model that regresses 3D human poses from multi-view 2D pose sequences, and introduced a biased attention mechanism that leverages geometric relationships between views and detection confidence scores. Our approach enables robust reconstruction of 3D human poses under heavy occlusion and when few input views are available.<br/><br/>In Paper IV, we tackle open-vocabulary 3D object detection from sparse multi-view RGB data. Our approach builds on pre-trained, off-the-shelf 2D networks and does not require retraining. We lift 2D detections into 3D via monocular depth estimation, followed by multi-view feature consistency optimization and 3D fusion of sparse proposals. Our experiments show that this approach can produce comparable results to state-of-the-art methods in the densely sampled setting while significantly outperforming the state-of-the-art for instances with sparse-views.}}, author = {{Moliner, Olivier}}, isbn = {{978-91-8104-604-5}}, keywords = {{Multi-view Geometry; Extrinsic Camera Calibration; Multi-camera System; 3D Human Pose Estimation; Skeleton-based Action Recognition; Self-supervised Learning; 3D Object Detection; 3D Scene Understanding}}, language = {{eng}}, publisher = {{Lund University / Centre for Mathematical Sciences /LTH}}, school = {{Lund University}}, title = {{Sparse Multi-View Computer Vision for 3D Human and Scene Understanding}}, url = {{https://lup.lub.lu.se/search/files/227712297/Thesis_Olivier_Moliner.pdf}}, year = {{2025}}, }