電子情報学専攻修士論文審査会 Future Person Localization in...

28
電子情報学専攻 修士論文審査会 Future Person Localization in First-Person Videos (一人称視点映像における人物位置予測) 19/02/04 佐藤洋一研究室 八木 拓真 (Takuma Yagi) 1

Transcript of 電子情報学専攻修士論文審査会 Future Person Localization in...

Page 1: 電子情報学専攻修士論文審査会 Future Person Localization in ...電子情報学専攻修士論文審査会 Future Person Localization in First-Person Videos (一人称視点映像における人物位置予測)

電子情報学専攻修士論文審査会

Future Person Localization in

First-Person Videos(一人称視点映像における人物位置予測)

19/02/04

佐藤洋一研究室

八木拓真 (Takuma Yagi)

1

Page 2: 電子情報学専攻修士論文審査会 Future Person Localization in ...電子情報学専攻修士論文審査会 Future Person Localization in First-Person Videos (一人称視点映像における人物位置予測)

First-person vision

Use body-worn wearable cameras

Analyze videos which reflect wearer’s action and interest

2

Page 3: 電子情報学専攻修士論文審査会 Future Person Localization in ...電子情報学専攻修士論文審査会 Future Person Localization in First-Person Videos (一人称視点映像における人物位置予測)

3

Page 4: 電子情報学専攻修士論文審査会 Future Person Localization in ...電子情報学専攻修士論文審査会 Future Person Localization in First-Person Videos (一人称視点映像における人物位置予測)

?

4

Page 5: 電子情報学専攻修士論文審査会 Future Person Localization in ...電子情報学専攻修士論文審査会 Future Person Localization in First-Person Videos (一人称視点映像における人物位置予測)

5

Page 6: 電子情報学専攻修士論文審査会 Future Person Localization in ...電子情報学専攻修士論文審査会 Future Person Localization in First-Person Videos (一人称視点映像における人物位置予測)

Future person localization in third-person videos

(1) The use of appearance feature

Learns preference of walkable area [Kitani+, ECCV’12]

The use of holistic visual attributes [Ma+, CVPR’17]

(2) The use of interaction between people

Computer simulation (Social force) [Helbing+, ‘95]

Data-driven approach (Social LSTM) [Alahi+, CVPR’16]

6

Page 7: 電子情報学専攻修士論文審査会 Future Person Localization in ...電子情報学専攻修士論文審査会 Future Person Localization in First-Person Videos (一人称視点映像における人物位置予測)

Future person localization in third-person videos

Social LSTM [Alahi+, CVPR’16]

Model each pedestrian by a LSTM

Social pooling layer squashes the features of

neighboring people into a fixed-size vector

7Cannot directly apply to first-person videos

Page 8: 電子情報学専攻修士論文審査会 Future Person Localization in ...電子情報学専攻修士論文審査会 Future Person Localization in First-Person Videos (一人称視点映像における人物位置予測)

First-Person future person localization

Predicts the wearer’s future position8

Egocentric Future Localization [Park+, CVPR’16]

Page 9: 電子情報学専攻修士論文審査会 Future Person Localization in ...電子情報学専攻修士論文審査会 Future Person Localization in First-Person Videos (一人称視点映像における人物位置予測)

Future person localization in first-person videos

𝑡

? ? ?

Our Challenge:

To develop a future person localization method tailored to first-person videos

Current

9

Page 10: 電子情報学専攻修士論文審査会 Future Person Localization in ...電子情報学専攻修士論文審査会 Future Person Localization in First-Person Videos (一人称視点映像における人物位置予測)

Our approach

Incorporating both pose and ego-motion as a salient cue in first-person videos

Multi-stream CNN to predict the future locations of a person

Pose indicates future direction Ego-motion captures interactive locomotion

10

Page 11: 電子情報学専攻修士論文審査会 Future Person Localization in ...電子情報学専攻修士論文審査会 Future Person Localization in First-Person Videos (一人称視点映像における人物位置予測)

Proposed method: tri-stream 1D-CNN

Location & scale Poses Ego-motions

Multi-stream conv-net

11

Input: sequence of each feature

Convolution

Page 12: 電子情報学専攻修士論文審査会 Future Person Localization in ...電子情報学専攻修士論文審査会 Future Person Localization in First-Person Videos (一人称視点映像における人物位置予測)

Proposed method: tri-stream 1D-CNN

Concatenate

DeconvolutionConvolution

Output: sequence of future locations and scales

Single-stream deconv-net

Multi-stream conv-net

12

Locations & scales Poses Ego-motions

Input: sequence of each feature

Page 13: 電子情報学専攻修士論文審査会 Future Person Localization in ...電子情報学専攻修士論文審査会 Future Person Localization in First-Person Videos (一人称視点映像における人物位置予測)

Feature representation

Location-scale cue (3 dims)

Location (2 dims) + scale (1 dim)

Captures perspective effect by the apparent size

Pose cue (2D×18 keypoints=36 dims)

Used pretrained OpenPose [Cao+, CVPR’17]

Normalized position and scale

Imputed missing detections

Target feature

Ego-motion feature

Location

Scale

𝑥𝑧

⊙𝑦

13

Ego-motion cue (6 dims)

Camera pose estimation from multiple frames [Zhou+, CVPR’17]

Translation (3 dims) + rotation (3 dims)

Accumulate local movement between frames

Page 14: 電子情報学専攻修士論文審査会 Future Person Localization in ...電子情報学専攻修士論文審査会 Future Person Localization in First-Person Videos (一人称視点映像における人物位置予測)

Data collection

Recorded walking video sequences in diverse cluttered scenes

One subject, total 4.5 hours, captured over 5,000 people

Annotations by tracking people

□: tracked ≧2s, □: tracked <2s

14

Page 15: 電子情報学専攻修士論文審査会 Future Person Localization in ...電子情報学専攻修士論文審査会 Future Person Localization in First-Person Videos (一人称視点映像における人物位置予測)

Baseline methods

Constant: Use location at the final input frame as prediction

ConstVel: Assume a constant velocity model using the mean speed of inputs

NNeighbor: Extracts k (=16) nearest neighbor input sequences, then

produce output as the mean of the corresponding locations.

Social LSTM [Alahi+, CVPR’16]: The state-of-the-art method on fixed cameras

15

Page 16: 電子情報学専攻修士論文審査会 Future Person Localization in ...電子情報学専攻修士論文審査会 Future Person Localization in First-Person Videos (一人称視点映像における人物位置予測)

Prediction example (input: 1sec, output: 1sec)

GT ProposedSocial LSTM [Alahi+, CVPR’16]

Input NNeighbor16

Page 17: 電子情報学専攻修士論文審査会 Future Person Localization in ...電子情報学専攻修士論文審査会 Future Person Localization in First-Person Videos (一人称視点映像における人物位置予測)

Prediction example (input: 1sec, output: 1sec)

GT ProposedSocial LSTMInput NNeighbor17

Page 18: 電子情報学専攻修士論文審査会 Future Person Localization in ...電子情報学専攻修士論文審査会 Future Person Localization in First-Person Videos (一人称視点映像における人物位置予測)

8.54 8.377.69

9.25

6.04

0

2

4

6

8

10

Constant ConstVel NNeighbor

Social LSTM Proposed

(%)

Quantitative evaluation

One-second prediction error (unit: % against frame width)

Equivalent to

60cm physical

error

18

Page 19: 電子情報学専攻修士論文審査会 Future Person Localization in ...電子情報学専攻修士論文審査会 Future Person Localization in First-Person Videos (一人称視点映像における人物位置予測)

FeaturesWalking direction

Toward Away Average

Location + scale 9.26 6.02 6.40

+ Ego-motion 8.80 5.80 6.18

+ Pose 8.38 6.00 6.29

Proposed 8.06 5.76 6.04

Ablation study

Pose ( ) contributes to predicting who comes Towards the wearer

Ego-motion ( ) contributes to predicting who walks Away from the wearer

(%)

19

One-second prediction error (unit: % against frame width)

Page 20: 電子情報学専攻修士論文審査会 Future Person Localization in ...電子情報学専攻修士論文審査会 Future Person Localization in First-Person Videos (一人称視点映像における人物位置予測)

Effect of prediction length

Prediction error linearly increases with prediction length

Error increase rate is lower than the Social LSTM baseline20

0

2

4

6

8

10

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

Fin

al D

isp

lace

men

t Err

or

(%

)

Prediction length (second)

Social LSTM

Proposed

Prediction error (unit: % against frame width)

Page 21: 電子情報学専攻修士論文審査会 Future Person Localization in ...電子情報学専攻修士論文審査会 Future Person Localization in First-Person Videos (一人称視点映像における人物位置予測)

Predicting longer-term future

21

MethodWalking direction

Toward Away Average Average (1.0s)

Social LSTM 22.12 17.56 17.75 9.23

Proposed 13.68 9.54 9.75 6.04

two-second prediction error (unit: % against frame width)

Input: 0.6sec, output: 2.0sec

Able to predict longer-term future with modest error increase

(%)

Page 22: 電子情報学専攻修士論文審査会 Future Person Localization in ...電子情報学専攻修士論文審査会 Future Person Localization in First-Person Videos (一人称視点映像における人物位置予測)

Failure case (existence of obstacles)

GT ProposedSocial LSTMInput NNeighbor22

Page 23: 電子情報学専攻修士論文審査会 Future Person Localization in ...電子情報学専攻修士論文審査会 Future Person Localization in First-Person Videos (一人称視点映像における人物位置予測)

Failure case (sudden direction change of the wearer)

GT ProposedInput23

Page 24: 電子情報学専攻修士論文審査会 Future Person Localization in ...電子情報学専攻修士論文審査会 Future Person Localization in First-Person Videos (一人称視点映像における人物位置予測)

Study in social interactions dataset [Fathi+, CVPR’12]

Head-mounted videos in a theme park (more challenging setting)

24

Page 25: 電子情報学専攻修士論文審査会 Future Person Localization in ...電子情報学専攻修士論文審査会 Future Person Localization in First-Person Videos (一人称視点映像における人物位置予測)

Quantitative evaluation

Modest performance even in head-mounted videos

11.6513.34 12.65

13.9

9.10

0

5

10

15

Constant ConstVel NNeighbor

Social LSTM Proposed

(%)

1 second prediction error (unit: % against frame width)

25

Page 26: 電子情報学専攻修士論文審査会 Future Person Localization in ...電子情報学専攻修士論文審査会 Future Person Localization in First-Person Videos (一人称視点映像における人物位置予測)

Prediction examples (input: 1sec, output: 1sec)

-0.9s Current +1.0s

GT ProposedInput26

Page 27: 電子情報学専攻修士論文審査会 Future Person Localization in ...電子情報学専攻修士論文審査会 Future Person Localization in First-Person Videos (一人称視点映像における人物位置予測)

Summary

New Problem

Future person localization in first-person videos

Future Directions

Forecasting under uncertainty

Separating prediction of the wearer and the target

Finding

Both target’s pose and wearer’s ego-motion were shown to be effective cues

27

Limitations

Cross-subject evaluation (assume a single wearer in this work)

Offline inference (currently not real-time)

Page 28: 電子情報学専攻修士論文審査会 Future Person Localization in ...電子情報学専攻修士論文審査会 Future Person Localization in First-Person Videos (一人称視点映像における人物位置予測)

Publications

International conference (refereed)

Takuma Yagi, Karttikeya Mangalam, Ryo Yonetani and Yoichi Sato, Future Person

Localization in First-Person Videos, In IEEE Conference on Computer Vision and

Pattern Recognition (CVPR), pp. 7593-7602, 2018. (Spotlight oral, acceptance

rate: 8.9%)

Domestic research workshop (non-refereed)

八木拓真, マンガラムカーティケヤ, 米谷竜, 佐藤洋一, 一人称視点映像における人物位置予測, 第21回画像の認識・理解シンポジウム (MIRU), 2018.

八木拓真, マンガラムカーティケヤ, 米谷竜, 佐藤洋一, 一人称視点映像における人物位置予測, 第211回CVIM研究会, 2018.

28