Deep network for the integrated 3D sensing of multiple people in natural images

Zanfir, Andrei; Marinoiu, Elisabeta; Zanfir, Mihai; Popa, Alin Ionut; Sminchisescu, Cristian

Deep network for the integrated 3D sensing of multiple people in natural images

Mark

Zanfir, Andrei ; Marinoiu, Elisabeta ; Zanfir, Mihai ; Popa, Alin Ionut and Sminchisescu, Cristian ^LU (2018) 32nd Conference on Neural Information Processing Systems, NeurIPS 2018 In Advances in Neural Information Processing Systems 2018-December. p.8410-8419

Abstract: We present MubyNet - a feed-forward, multitask, bottom up system for the integrated localization, as well as 3d pose and shape estimation, of multiple people in monocular images. The challenge is the formal modeling of the problem that intrinsically requires discrete and continuous computation, e.g. grouping people vs. predicting 3d pose. The model identifies human body structures (joints and limbs) in images, groups them based on 2d and 3d information fused using learned scoring functions, and optimally aggregates such responses into partial or complete 3d human skeleton hypotheses under kinematic tree constraints, but without knowing in advance the number of people in the scene and their visibility relations. We design a multi-task... (More); We present MubyNet - a feed-forward, multitask, bottom up system for the integrated localization, as well as 3d pose and shape estimation, of multiple people in monocular images. The challenge is the formal modeling of the problem that intrinsically requires discrete and continuous computation, e.g. grouping people vs. predicting 3d pose. The model identifies human body structures (joints and limbs) in images, groups them based on 2d and 3d information fused using learned scoring functions, and optimally aggregates such responses into partial or complete 3d human skeleton hypotheses under kinematic tree constraints, but without knowing in advance the number of people in the scene and their visibility relations. We design a multi-task deep neural network with differentiable stages where the person grouping problem is formulated as an integer program based on learned body part scores parameterized by both 2d and 3d information. This avoids suboptimality resulting from separate 2d and 3d reasoning, with grouping performed based on the combined representation. The final stage of 3d pose and shape prediction is based on a learned attention process where information from different human body parts is optimally integrated. State-of-the-art results are obtained in large scale datasets like Human3.6M and Panoptic, and qualitatively by reconstructing the 3d shape and pose of multiple people, under occlusion, in difficult monocular images.
(Less)

Please use this url to cite or link to this publication: https://lup.lub.lu.se/record/2f76e797-e555-4382-8009-8de26d87edc2

author

Zanfir, Andrei ; Marinoiu, Elisabeta ; Zanfir, Mihai ; Popa, Alin Ionut and Sminchisescu, Cristian ^LU

organization

publishing date

2018

type

Chapter in Book/Report/Conference proceeding

publication status

published

subject

Computer graphics and computer vision

host publication

Advances in Neural Information Processing Systems 31 (NIPS 2018)

series title

Advances in Neural Information Processing Systems

volume

2018-December

pages

10 pages

conference name

32nd Conference on Neural Information Processing Systems, NeurIPS 2018

conference location

Montreal, Canada

conference dates

2018-12-02 - 2018-12-08

external identifiers

scopus:85064803925

ISSN

1049-5258

language

English

LU publication?

yes

id

2f76e797-e555-4382-8009-8de26d87edc2

date added to LUP

2019-05-08 14:48:48

date last changed

2025-10-14 09:46:42

@inproceedings{2f76e797-e555-4382-8009-8de26d87edc2,
  abstract     = {{<p>We present MubyNet - a feed-forward, multitask, bottom up system for the integrated localization, as well as 3d pose and shape estimation, of multiple people in monocular images. The challenge is the formal modeling of the problem that intrinsically requires discrete and continuous computation, e.g. grouping people vs. predicting 3d pose. The model identifies human body structures (joints and limbs) in images, groups them based on 2d and 3d information fused using learned scoring functions, and optimally aggregates such responses into partial or complete 3d human skeleton hypotheses under kinematic tree constraints, but without knowing in advance the number of people in the scene and their visibility relations. We design a multi-task deep neural network with differentiable stages where the person grouping problem is formulated as an integer program based on learned body part scores parameterized by both 2d and 3d information. This avoids suboptimality resulting from separate 2d and 3d reasoning, with grouping performed based on the combined representation. The final stage of 3d pose and shape prediction is based on a learned attention process where information from different human body parts is optimally integrated. State-of-the-art results are obtained in large scale datasets like Human3.6M and Panoptic, and qualitatively by reconstructing the 3d shape and pose of multiple people, under occlusion, in difficult monocular images.</p>}},
  author       = {{Zanfir, Andrei and Marinoiu, Elisabeta and Zanfir, Mihai and Popa, Alin Ionut and Sminchisescu, Cristian}},
  booktitle    = {{Advances in Neural Information Processing Systems 31 (NIPS 2018)}},
  issn         = {{1049-5258}},
  language     = {{eng}},
  pages        = {{8410--8419}},
  series       = {{Advances in Neural Information Processing Systems}},
  title        = {{Deep network for the integrated 3D sensing of multiple people in natural images}},
  volume       = {{2018-December}},
  year         = {{2018}},
}

Lund University Publications

LUND UNIVERSITY LIBRARIES

Deep network for the integrated 3D sensing of multiple people in natural images