Spatio-temporal attention models for grounded video captioning

Zanfir, Mihai; Marinoiu, Elisabeta; Sminchisescu, Cristian

Spatio-temporal attention models for grounded video captioning

Mark

Zanfir, Mihai ; Marinoiu, Elisabeta and Sminchisescu, Cristian ^LU (2017) In Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) 10114 LNCS. p.104-119

Abstract: Automatic video captioning is challenging due to the complex interactions in dynamic real scenes. A comprehensive system would ultimately localize and track the objects, actions and interactions present in a video and generate a description that relies on temporal localization in order to ground the visual concepts. However, most existing automatic video captioning systems map from raw video data to high level textual description, bypassing localization and recognition, thus discarding potentially valuable information for content localization and generalization. In this work we present an automatic video captioning model that combines spatio-temporal attention and image classification by means of deep neural network structures based on... (More); Automatic video captioning is challenging due to the complex interactions in dynamic real scenes. A comprehensive system would ultimately localize and track the objects, actions and interactions present in a video and generate a description that relies on temporal localization in order to ground the visual concepts. However, most existing automatic video captioning systems map from raw video data to high level textual description, bypassing localization and recognition, thus discarding potentially valuable information for content localization and generalization. In this work we present an automatic video captioning model that combines spatio-temporal attention and image classification by means of deep neural network structures based on long short-term memory. The resulting system is demonstrated to produce state-of-the-art results in the standard YouTube captioning benchmark while also offering the advantage of localizing the visual concepts (subjects, verbs, objects), with no grounding supervision, over space and time.
(Less)

Please use this url to cite or link to this publication: https://lup.lub.lu.se/record/bddc74a2-9f6d-4215-8826-0cf95609c23f

author

Zanfir, Mihai ; Marinoiu, Elisabeta and Sminchisescu, Cristian ^LU

organization

publishing date

2017

type

Chapter in Book/Report/Conference proceeding

publication status

published

subject

Computer graphics and computer vision

host publication

Computer Vision - 13th Asian Conference on Computer Vision, ACCV 2016, Revised Selected Papers

series title

Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)

volume

10114 LNCS

pages

16 pages

publisher

Springer

external identifiers

scopus:85016047898

ISSN

16113349

03029743

ISBN

9783319541891

DOI

10.1007/978-3-319-54190-7_7

language

English

LU publication?

yes

id

bddc74a2-9f6d-4215-8826-0cf95609c23f

date added to LUP

2017-04-06 15:12:25

date last changed

2025-04-04 14:29:31

@inbook{bddc74a2-9f6d-4215-8826-0cf95609c23f,
  abstract     = {{<p>Automatic video captioning is challenging due to the complex interactions in dynamic real scenes. A comprehensive system would ultimately localize and track the objects, actions and interactions present in a video and generate a description that relies on temporal localization in order to ground the visual concepts. However, most existing automatic video captioning systems map from raw video data to high level textual description, bypassing localization and recognition, thus discarding potentially valuable information for content localization and generalization. In this work we present an automatic video captioning model that combines spatio-temporal attention and image classification by means of deep neural network structures based on long short-term memory. The resulting system is demonstrated to produce state-of-the-art results in the standard YouTube captioning benchmark while also offering the advantage of localizing the visual concepts (subjects, verbs, objects), with no grounding supervision, over space and time.</p>}},
  author       = {{Zanfir, Mihai and Marinoiu, Elisabeta and Sminchisescu, Cristian}},
  booktitle    = {{Computer Vision - 13th Asian Conference on Computer Vision, ACCV 2016, Revised Selected Papers}},
  isbn         = {{9783319541891}},
  issn         = {{16113349}},
  language     = {{eng}},
  pages        = {{104--119}},
  publisher    = {{Springer}},
  series       = {{Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)}},
  title        = {{Spatio-temporal attention models for grounded video captioning}},
  url          = {{http://dx.doi.org/10.1007/978-3-319-54190-7_7}},
  doi          = {{10.1007/978-3-319-54190-7_7}},
  volume       = {{10114 LNCS}},
  year         = {{2017}},
}

Lund University Publications

LUND UNIVERSITY LIBRARIES

Spatio-temporal attention models for grounded video captioning