Advanced

Recall and perceived naturalness of asynchronous speech and gesture

Nirme, Jens LU (2016) 7th Conference of the International Society for Gesture Studies p.257-257
Abstract (Swedish)

Abstract
Part of the justification for an integrated view of speech and gestures ( (( is their temporal coordination. Gestures generally coincide with or precede, but rarely follow lexical affiliate (McNeill, 1992). How synchrony impacts listeners remains less explored, despite potential relevance for video communication and virtual conversational agents. ERP studies suggest that temporal alignment affects how words and gestures are integrated (Obermeier & Gunter, 2015) (Habets et al, 2011). Explicit perception of asynchrony is less sensitive and shifts longer than 1s can be tolerated (Kirchhof, 2014). However, gestures that are preceded by their lexical affiliates deviate from the expected pattern given regular exposure to speech which might... (More)
Part of the justification for an integrated view of speech and gestures ( (( is their temporal coordination. Gestures generally coincide with or precede, but rarely follow lexical affiliate (McNeill, 1992). How synchrony impacts listeners remains less explored, despite potential relevance for video communication and virtual conversational agents. ERP studies suggest that temporal alignment affects how words and gestures are integrated (Obermeier & Gunter, 2015) (Habets et al, 2011). Explicit perception of asynchrony is less sensitive and shifts longer than 1s can be tolerated (Kirchhof, 2014). However, gestures that are preceded by their lexical affiliates deviate from the expected pattern given regular exposure to speech which might implicitly affect listeners. We investigated whether the asymmetry of timing observed in production was reflected in differential effects of gestures shifted in either direction on how listeners perceive the speakers behavior as natural (Exp1) and/or impairing their processing and subsequent recall of words. (Exp2) Using motion capture to animate virtual speakers (giving explanations) allowed shifting specific gesture strokes within longer segments while preserving synchronized lip movements. For 16 short segments we produced videos in 3 conditions defined by the timing of a target gesture stroke relative a target word; either overlapping (SYNC) or shifted 500ms earlier (GIBEFORE) or later (GIAFTER). We classified the verbal content overlapping with shifted strokes by (unequally frequent) categories ”congruent”, ”incongruent” or ”filled/unfilled pauses”. In Exp1, 32 participants saw a composition of 4 videos from each of the 3 mentioned conditions plus a variation of SYNC with distorted pitch during a few nonItarget words (AUDIO). After each video the participants rated their impression that it was based on a capture of natural or was artificially generated (by an undefined algorithm). We transformed each participant’s responses to the range between 0 (most artificial) and 1(most natural). Results revealed no significant differences between conditions. However, comparing the ratings between the categories of overlap revealed that strokes shifted to ”filled /unfilled pauses” were rated as more artificial. In Exp2, 79 participants saw all 16 videos in one of four conditions. SYNC, GIBEFORE and GIAFTER were contrasted by a condition with seamlessly extinguished target gestures. Following each video and a distraction task, participants attempted to repeat what they heard in the video. Results revealed impaired recall of target words with extinguished or delayed gestures. In summary, asynchronous gestures were not perceived as less natural if overlapping with any words. Synchronous and preceding, but not following, gestures facilitated recall, as expected if the processing of speech and gestures (involved in this particular task) would be tuned to temporal patterns common in natural speech. (Less)
Please use this url to cite or link to this publication:
author
organization
publishing date
type
Contribution to conference
publication status
published
subject
keywords
Co-speech gestures, multimodal integration, timing, animation, memory, comprehension
pages
1 pages
conference name
7th Conference of the International Society for Gesture Studies
language
English
LU publication?
yes
id
d436d32c-5c80-4384-8123-b50c130f0f1d
alternative location
https://isgs7.sciencesconf.org/conference/isgs7/ISGS_BoA_16_07_19.pdf#page=257
date added to LUP
2016-12-09 14:10:31
date last changed
2016-12-09 14:25:32
@misc{d436d32c-5c80-4384-8123-b50c130f0f1d,
  abstract     = {Part of the justification for an integrated view of speech and gestures ( (( is their temporal coordination. Gestures generally coincide with or precede, but rarely follow lexical affiliate (McNeill, 1992). How synchrony impacts listeners remains less explored, despite potential relevance for video communication and virtual conversational agents. ERP studies suggest that temporal alignment affects how words and gestures are integrated (Obermeier & Gunter, 2015) (Habets et al, 2011). Explicit perception of asynchrony is less sensitive and shifts longer than 1s can be tolerated (Kirchhof, 2014). However, gestures that are preceded by their lexical affiliates deviate from the expected pattern given regular exposure to speech which might implicitly affect listeners. We investigated whether the asymmetry of timing observed in production was reflected in differential effects of gestures shifted in either direction on how listeners perceive the speakers behavior as natural (Exp1) and/or impairing their processing and subsequent recall of words. (Exp2) Using motion capture to animate virtual speakers (giving explanations) allowed shifting specific gesture strokes within longer segments while preserving synchronized lip movements. For 16 short segments we produced videos in 3 conditions defined by the timing of a target gesture stroke relative a target word; either overlapping (SYNC) or shifted 500ms earlier (GIBEFORE) or later (GIAFTER). We classified the verbal content overlapping with shifted strokes by (unequally frequent) categories ”congruent”, ”incongruent” or ”filled/unfilled pauses”. In Exp1, 32 participants saw a composition of 4 videos from each of the 3 mentioned conditions plus a variation of SYNC with distorted pitch during a few nonItarget words (AUDIO). After each video the participants rated their impression that it was based on a capture of natural or was artificially generated (by an undefined algorithm). We transformed each participant’s responses to the range between 0 (most artificial) and 1(most natural). Results revealed no significant differences between conditions. However, comparing the ratings between the categories of overlap revealed that strokes shifted to ”filled /unfilled pauses” were rated as more artificial. In Exp2, 79 participants saw all 16 videos in one of four conditions. SYNC, GIBEFORE and GIAFTER were contrasted by a condition with seamlessly extinguished target gestures. Following each video and a distraction task, participants attempted to repeat what they heard in the video. Results revealed impaired recall of target words with extinguished or delayed gestures. In summary, asynchronous gestures were not perceived as less natural if overlapping with any words. Synchronous and preceding, but not following, gestures facilitated recall, as expected if the processing of speech and gestures (involved in this particular task) would be tuned to temporal patterns common in natural speech.},
  author       = {Nirme, Jens},
  keyword      = {Co-speech gestures,multimodal integration,timing,animation,memory,comprehension},
  language     = {eng},
  month        = {07},
  pages        = {257--257},
  title        = {Recall and perceived naturalness of asynchronous speech and gesture},
  year         = {2016},
}