Geometry-Biased Transformer for Robust Multi-View 3D Human Pose Reconstruction

Moliner, Olivier; Huang, Sangxia; Astrom, Kalle

Geometry-Biased Transformer for Robust Multi-View 3D Human Pose Reconstruction

Mark

Moliner, Olivier ^LU

; Huang, Sangxia and Astrom, Kalle ^LU

(2024) 18th IEEE International Conference on Automatic Face and Gesture Recognition, FG 2024

Abstract: We address the challenges in estimating 3D human poses from multiple views under occlusion and with limited overlapping views. We approach multi-view, single-person 3D human pose reconstruction as a regression problem and propose a novel encoder-decoder Transformer architecture to estimate 3D poses from multi-view 2D pose sequences. The encoder refines 2D skeleton joints detected across different views and times, fusing multi-view and temporal information through global self-attention. We enhance the encoder by incorporating a geometry-biased attention mechanism, effectively leveraging geometric relationships between views. Additionally, we use detection scores provided by the 2D pose detector to further guide the encoder's attention... (More); We address the challenges in estimating 3D human poses from multiple views under occlusion and with limited overlapping views. We approach multi-view, single-person 3D human pose reconstruction as a regression problem and propose a novel encoder-decoder Transformer architecture to estimate 3D poses from multi-view 2D pose sequences. The encoder refines 2D skeleton joints detected across different views and times, fusing multi-view and temporal information through global self-attention. We enhance the encoder by incorporating a geometry-biased attention mechanism, effectively leveraging geometric relationships between views. Additionally, we use detection scores provided by the 2D pose detector to further guide the encoder's attention based on the reliability of the 2D detections. The decoder subsequently regresses the 3D pose sequence from these refined tokens, using pre-defined queries for each joint. To enhance the generalization of our method to unseen scenes and improve resilience to missing joints, we implement strategies including scene centering, synthetic views, and token dropout. We conduct extensive experiments on three benchmark public datasets, Human3.6M, CMU Panoptic and Occlusion-Persons. Our results demonstrate the efficacy of our approach, particularly in occluded scenes and when few views are available, which are traditionally challenging scenarios for triangulation-based methods.
(Less)

Please use this url to cite or link to this publication: https://lup.lub.lu.se/record/04c52fd8-f45a-4147-812f-4591b6b303e3

author

Moliner, Olivier ^LU

; Huang, Sangxia and Astrom, Kalle ^LU

organization

publishing date

2024-07-11

type

Chapter in Book/Report/Conference proceeding

publication status

published

subject

Computer graphics and computer vision

host publication

2024 IEEE 18th International Conference on Automatic Face and Gesture Recognition, FG 2024

publisher

IEEE - Institute of Electrical and Electronics Engineers Inc.

conference name

18th IEEE International Conference on Automatic Face and Gesture Recognition, FG 2024

conference location

Istanbul, Turkey

conference dates

2024-05-27 - 2024-05-31

external identifiers

scopus:85199438076

ISBN

9798350394948

DOI

10.1109/FG59268.2024.10581930

language

English

LU publication?

yes

additional info

id

04c52fd8-f45a-4147-812f-4591b6b303e3

date added to LUP

2024-08-14 14:54:17

date last changed

2025-04-04 14:10:09

@inproceedings{04c52fd8-f45a-4147-812f-4591b6b303e3,
  abstract     = {{<p>We address the challenges in estimating 3D human poses from multiple views under occlusion and with limited overlapping views. We approach multi-view, single-person 3D human pose reconstruction as a regression problem and propose a novel encoder-decoder Transformer architecture to estimate 3D poses from multi-view 2D pose sequences. The encoder refines 2D skeleton joints detected across different views and times, fusing multi-view and temporal information through global self-attention. We enhance the encoder by incorporating a geometry-biased attention mechanism, effectively leveraging geometric relationships between views. Additionally, we use detection scores provided by the 2D pose detector to further guide the encoder's attention based on the reliability of the 2D detections. The decoder subsequently regresses the 3D pose sequence from these refined tokens, using pre-defined queries for each joint. To enhance the generalization of our method to unseen scenes and improve resilience to missing joints, we implement strategies including scene centering, synthetic views, and token dropout. We conduct extensive experiments on three benchmark public datasets, Human3.6M, CMU Panoptic and Occlusion-Persons. Our results demonstrate the efficacy of our approach, particularly in occluded scenes and when few views are available, which are traditionally challenging scenarios for triangulation-based methods.</p>}},
  author       = {{Moliner, Olivier and Huang, Sangxia and Astrom, Kalle}},
  booktitle    = {{2024 IEEE 18th International Conference on Automatic Face and Gesture Recognition, FG 2024}},
  isbn         = {{9798350394948}},
  language     = {{eng}},
  month        = {{07}},
  publisher    = {{IEEE - Institute of Electrical and Electronics Engineers Inc.}},
  title        = {{Geometry-Biased Transformer for Robust Multi-View 3D Human Pose Reconstruction}},
  url          = {{http://dx.doi.org/10.1109/FG59268.2024.10581930}},
  doi          = {{10.1109/FG59268.2024.10581930}},
  year         = {{2024}},
}

Lund University Publications

LUND UNIVERSITY LIBRARIES

Geometry-Biased Transformer for Robust Multi-View 3D Human Pose Reconstruction