Actions in the Eye: Dynamic Gaze Datasets and Learnt Saliency Models for Visual Recognition

Mathe, Stefan; Sminchisescu, Cristian

Actions in the Eye: Dynamic Gaze Datasets and Learnt Saliency Models for Visual Recognition

Mark

Mathe, Stefan and Sminchisescu, Cristian ^LU (2015) In IEEE Transactions on Pattern Analysis and Machine Intelligence 37(7). p.1408-1424

Abstract: Systems based on bag-of-words models from image features collected at maxima of sparse interest point operators have been used successfully for both computer visual object and action recognition tasks. While the sparse, interest-point based approach to recognition is not inconsistent with visual processing in biological systems that operate in 'saccade and fixate' regimes, the methodology and emphasis in the human and the computer vision communities remains sharply distinct. Here, we make three contributions aiming to bridge this gap. First, we complement existing state-of-the art large scale dynamic computer vision annotated datasets like Hollywood-2 [1] and UCF Sports [2] with human eye movements collected under the ecological... (More); Systems based on bag-of-words models from image features collected at maxima of sparse interest point operators have been used successfully for both computer visual object and action recognition tasks. While the sparse, interest-point based approach to recognition is not inconsistent with visual processing in biological systems that operate in 'saccade and fixate' regimes, the methodology and emphasis in the human and the computer vision communities remains sharply distinct. Here, we make three contributions aiming to bridge this gap. First, we complement existing state-of-the art large scale dynamic computer vision annotated datasets like Hollywood-2 [1] and UCF Sports [2] with human eye movements collected under the ecological constraints of visual action and scene context recognition tasks. To our knowledge these are the first large human eye tracking datasets to be collected and made publicly available for video, vision. imar. ro/eyetracking (497,107 frames, each viewed by 19 subjects), unique in terms of their (a) large scale and computer vision relevance, (b) dynamic, video stimuli, (c) task control, as well as free-viewing. Second, we introduce novel dynamic consistency and alignment measures, which underline the remarkable stability of patterns of visual search among subjects. Third, we leverage the significant amount of collected data in order to pursue studies and build automatic, end-to-end trainable computer vision systems based on human eye movements. Our studies not only shed light on the differences between computer vision spatio-temporal interest point image sampling strategies and the human fixations, as well as their impact for visual recognition performance, but also demonstrate that human fixations can be accurately predicted, and when used in an end-to-end automatic system, leveraging some of the advanced computer vision practice, can lead to state of the art results. (Less)

Please use this url to cite or link to this publication: https://lup.lub.lu.se/record/7602115

author

Mathe, Stefan and Sminchisescu, Cristian ^LU

organization

publishing date

2015

type

Contribution to journal

publication status

published

subject

Computer graphics and computer vision

keywords

Visual action recognition, human eye-movements, consistency analysis, saliency prediction, large scale learning

in

IEEE Transactions on Pattern Analysis and Machine Intelligence

volume

37

issue

7

pages

1408 - 1424

publisher

IEEE - Institute of Electrical and Electronics Engineers Inc.

external identifiers

wos:000355931100009
scopus:84961654805
pmid:26352449

ISSN

1939-3539

DOI

10.1109/TPAMI.2014.2366154

language

English

LU publication?

yes

id

e4efe293-637e-4466-ac49-c9675eeea446 (old id 7602115)

date added to LUP

2016-04-01 13:58:15

date last changed

2025-04-04 15:26:47

@article{e4efe293-637e-4466-ac49-c9675eeea446,
  abstract     = {{Systems based on bag-of-words models from image features collected at maxima of sparse interest point operators have been used successfully for both computer visual object and action recognition tasks. While the sparse, interest-point based approach to recognition is not inconsistent with visual processing in biological systems that operate in 'saccade and fixate' regimes, the methodology and emphasis in the human and the computer vision communities remains sharply distinct. Here, we make three contributions aiming to bridge this gap. First, we complement existing state-of-the art large scale dynamic computer vision annotated datasets like Hollywood-2 [1] and UCF Sports [2] with human eye movements collected under the ecological constraints of visual action and scene context recognition tasks. To our knowledge these are the first large human eye tracking datasets to be collected and made publicly available for video, vision. imar. ro/eyetracking (497,107 frames, each viewed by 19 subjects), unique in terms of their (a) large scale and computer vision relevance, (b) dynamic, video stimuli, (c) task control, as well as free-viewing. Second, we introduce novel dynamic consistency and alignment measures, which underline the remarkable stability of patterns of visual search among subjects. Third, we leverage the significant amount of collected data in order to pursue studies and build automatic, end-to-end trainable computer vision systems based on human eye movements. Our studies not only shed light on the differences between computer vision spatio-temporal interest point image sampling strategies and the human fixations, as well as their impact for visual recognition performance, but also demonstrate that human fixations can be accurately predicted, and when used in an end-to-end automatic system, leveraging some of the advanced computer vision practice, can lead to state of the art results.}},
  author       = {{Mathe, Stefan and Sminchisescu, Cristian}},
  issn         = {{1939-3539}},
  keywords     = {{Visual action recognition; human eye-movements; consistency analysis; saliency prediction; large scale learning}},
  language     = {{eng}},
  number       = {{7}},
  pages        = {{1408--1424}},
  publisher    = {{IEEE - Institute of Electrical and Electronics Engineers Inc.}},
  series       = {{IEEE Transactions on Pattern Analysis and Machine Intelligence}},
  title        = {{Actions in the Eye: Dynamic Gaze Datasets and Learnt Saliency Models for Visual Recognition}},
  url          = {{http://dx.doi.org/10.1109/TPAMI.2014.2366154}},
  doi          = {{10.1109/TPAMI.2014.2366154}},
  volume       = {{37}},
  year         = {{2015}},
}

Lund University Publications

LUND UNIVERSITY LIBRARIES

Actions in the Eye: Dynamic Gaze Datasets and Learnt Saliency Models for Visual Recognition