MobilePose: Real-Time 3D Hand Pose Estimation from a Single RGB Image

Larsson, David; Hellmark, Sofie

MobilePose: Real-Time 3D Hand Pose Estimation from a Single RGB Image

Mark

Larsson, David ^LU and Hellmark, Sofie ^LU (2020) In Master's Theses in Mathematical Sciences FMAM05 20201
Mathematics (Faculty of Engineering)

Abstract: Estimating 3D hand poses from RGB images is a challenging task. In this work we construct efficient neural networks to regress sparse 3D skeletons consisting of 21 keypoints in the hand. Additionally heatmaps are regressed to locate the keypoints in 2D. The networks created can be divided into three parts: feature extraction, heatmap regression and 3D pose regression. To obtain the 3D coordinates relative to the camera we introduce a method based on the best projection given the predictions. Main focus has been investigating network structures proven to be efficient in other computer vision task: EfficientNet, EfficientDet and MobileNetV2. A weighted bi-directional feature pyramid network, BiFPN, inspired by EfficientDet was added to... (More); Estimating 3D hand poses from RGB images is a challenging task. In this work we construct efficient neural networks to regress sparse 3D skeletons consisting of 21 keypoints in the hand. Additionally heatmaps are regressed to locate the keypoints in 2D. The networks created can be divided into three parts: feature extraction, heatmap regression and 3D pose regression. To obtain the 3D coordinates relative to the camera we introduce a method based on the best projection given the predictions. Main focus has been investigating network structures proven to be efficient in other computer vision task: EfficientNet, EfficientDet and MobileNetV2. A weighted bi-directional feature pyramid network, BiFPN, inspired by EfficientDet was added to MobileNetV2. This resulted in a new proposed network structure, MobilePose. The size of a network is affected by the input image resolution. Decreasing the resolution resulted in lower inference time. Images with the size 112 × 112 was used to achieve real-time performance. However the best accuracy was obtained with 224 × 224 images, the highest resolution tested. EfficientDet and MobilePose performed best and similar in terms of accuracy on the FreiHAND dataset. Comparing inference time on a Samsung S10 mobile device MobilePose is preferred. MobilePose was improved by adding complexity to the network. This resulted in achieving the highest accuracy in total with an average keypoint error of 1.4 cm, assuming the depth of a root keypoint is know, and 5.0 cm with calculated root depth. (Less)
Popular Abstract (Swedish): Att i vardagen interagera med virtuella objekt med hög precision har länge ansetts vara futuristiskt. I vårt examensarbete har vi med neurala nätverk tagit ett första steg för att uppnå detta. Tänk om du kunnat prova kläder och smycken hemma innan du handlar på internet. Om du kunnat möblera din lägenhet innan du köpt möblerna. Tänk om du hade kunnat göra allt du gör med din telefon idag, och mycket mer, genom att interagera med en virtuell display som du såg genom dina nya smarta linser.

Området som detta håller på att hända inom kallas Augmented Reality, AR. Genom AR kan man i verkligheten interagera med tredimensionella virtuella objekt samt få upp virtuell information framför sig. Något som kan förenkla vardagen för dig som individ... (More); Att i vardagen interagera med virtuella objekt med hög precision har länge ansetts vara futuristiskt. I vårt examensarbete har vi med neurala nätverk tagit ett första steg för att uppnå detta. Tänk om du kunnat prova kläder och smycken hemma innan du handlar på internet. Om du kunnat möblera din lägenhet innan du köpt möblerna. Tänk om du hade kunnat göra allt du gör med din telefon idag, och mycket mer, genom att interagera med en virtuell display som du såg genom dina nya smarta linser.

Området som detta håller på att hända inom kallas Augmented Reality, AR. Genom AR kan man i verkligheten interagera med tredimensionella virtuella objekt samt få upp virtuell information framför sig. Något som kan förenkla vardagen för dig som individ men också inom många industrier. AR används idag inom bland annat reparation och underhåll för att snabbt få tillgång till information samt inom spelindustrin.

För kunna interagera med virtuella objekt så måste du ha en enhet, till exempel smarta glasögon eller en smart telefon, som uppfattar dina händers rörelser. För att mer avancerade användningsområden än de som finns idag ska bli verklighet är det även viktigt att veta händernas position med stor noggrannhet. Dessutom måste man kunna ta reda på det i realtid med en mobil enhet.

Det är just detta vi har jobbat med i vårt examensarbete. I realtid ta fram en 3D modell från en bild på en hand. För att göra detta har vi använt oss av verktyget artificiella neurala nätverk, en gren inom maskininlärning.

Vi har designat nätverk som från en bild på en hand återskapar den tredimensionella positionen av 21 leder i handen. Från dessa positioner tar vi fram en skelettmodell av handen i både två och tre dimensioner. Genom att visa många bilder med facit för nätverket lär det sig att ta fram 3D skelett från bilder det inte sett förut.

För 80% av alla ledpunkter i de handbilder vi testade på hittade nätverket positionen med en precision på under 2 cm och medelfelet var 1,4 cm. För att kunna köra nätverket i realtid minskade vi upplösningen på bilden. Detta bidrog till ett något sämre resultat där 65% av ledpunkterna placerades rätt med en precision på under 2 cm och medelfelet var 2 cm.

I många bilder var handen delvis skymd, både av objekt men också av sig själv. Trots detta gjorde nätverket ett bra jobb att återskapa handens 3D form. Den största svårigheten var att få rätt position i djupled.

Vårt resultat är bra jämfört med andra som testat sina modeller på samma bilder. Dessutom lyckas vi utföra uppgiften i realtid.

Arbetet har gjorts i samarbete med Crunchfish och är en del i att förbättra deras gestigenkänning samt utöka nuvarande användningsområdet. (Less)

Please use this url to cite or link to this publication: http://lup.lub.lu.se/student-papers/record/9022807

author

Larsson, David ^LU and Hellmark, Sofie ^LU

supervisor

organization

Mathematics (Faculty of Engineering)

alternative title

MobilePose: Beräkning av 3D hand pose från en RGB bild i realtid

course

FMAM05 20201

year

2020

type

H2 - Master's Degree (Two Years)

subject

Mathematics and Statistics

keywords

3D hand pose estimation, real-time, Convolutional neural networks, Image analysis, Machine learning, AR

publication/series

Master's Theses in Mathematical Sciences

report number

LUTFMA-3411-2020

ISSN

1404-6342

other publication id

2020:E31

language

English

id

9022807

date added to LUP

2020-06-29 14:17:42

date last changed

2020-06-29 14:17:42

@misc{9022807,
  abstract     = {{Estimating 3D hand poses from RGB images is a challenging task. In this work we construct efficient neural networks to regress sparse 3D skeletons consisting of 21 keypoints in the hand. Additionally heatmaps are regressed to locate the keypoints in 2D. The networks created can be divided into three parts: feature extraction, heatmap regression and 3D pose regression. To obtain the 3D coordinates relative to the camera we introduce a method based on the best projection given the predictions. Main focus has been investigating network structures proven to be efficient in other computer vision task: EfficientNet, EfficientDet and MobileNetV2. A weighted bi-directional feature pyramid network, BiFPN, inspired by EfficientDet was added to MobileNetV2. This resulted in a new proposed network structure, MobilePose. The size of a network is affected by the input image resolution. Decreasing the resolution resulted in lower inference time. Images with the size 112 × 112 was used to achieve real-time performance. However the best accuracy was obtained with 224 × 224 images, the highest resolution tested. EfficientDet and MobilePose performed best and similar in terms of accuracy on the FreiHAND dataset. Comparing inference time on a Samsung S10 mobile device MobilePose is preferred. MobilePose was improved by adding complexity to the network. This resulted in achieving the highest accuracy in total with an average keypoint error of 1.4 cm, assuming the depth of a root keypoint is know, and 5.0 cm with calculated root depth.}},
  author       = {{Larsson, David and Hellmark, Sofie}},
  issn         = {{1404-6342}},
  language     = {{eng}},
  note         = {{Student Paper}},
  series       = {{Master's Theses in Mathematical Sciences}},
  title        = {{MobilePose: Real-Time 3D Hand Pose Estimation from a Single RGB Image}},
  year         = {{2020}},
}

LUP Student Papers

LUND UNIVERSITY LIBRARIES

MobilePose: Real-Time 3D Hand Pose Estimation from a Single RGB Image