Semantically related gestures facilitate language comprehension during simultaneous interpreting

Eléonore Arbona; Kilian G. Seeber; Marianne Gullberg

doi:10.1017/S136672892200058X

Semantically related gestures facilitate language comprehension during simultaneous interpreting

Published online by Cambridge University Press: 10 October 2022

Eléonore Arbona

Kilian G. Seeber and

Marianne Gullberg

Show author details

Eléonore Arbona*: Affiliation:
Faculty of Translation and Interpreting, University of Geneva, Geneva, Switzerland
Kilian G. Seeber: Affiliation:
Faculty of Translation and Interpreting, University of Geneva, Geneva, Switzerland
Marianne Gullberg: Affiliation:
Centre for Languages and Literature and Lund University Humanities Lab, Lund University, Lund, Sweden
*: Author for correspondence: Eléonore Arbona, FTI, Département d'Interprétation Université de Genève 40 bd du Pont-d'Arve CH-1211 Genève 4 E-mail: [email protected]

Article contents

Abstract
Introduction
First experiment
Second experiment
General discussion
Conclusion
Competing interests
Data availability
Footnotes
References

Rights & Permissions

Abstract

Manual co-speech gestures can facilitate language comprehension, but do they influence language comprehension in simultaneous interpreters, and if so, is this influence modulated by simultaneous interpreting (SI) and/or by interpreting experience? In a picture-matching task, 24 professional interpreters and 24 professional translators were exposed to utterances accompanied by semantically matching representational gestures, semantically unrelated pragmatic gestures, or no gestures while viewing passively (interpreters and translators) or during SI (interpreters only). During passive viewing, both groups were faster with semantically related than with semantically unrelated gestures. During SI, interpreters showed the same result. The results suggest that language comprehension is sensitive to the semantic relationship between speech and gesture, and facilitated when speech and gestures are semantically linked. This sensitivity is not modulated by SI or interpreting experience. Thus, despite simultaneous interpreters’ extreme language use, multimodal language processing facilitates comprehension in SI the same way as in all other language processing.

Keywords

simultaneous interpreting language comprehension gestures multimodality bilingualism

Type: Research Article
Information: Bilingualism: Language and Cognition , Volume 26 , Issue 2 , March 2023 , pp. 425 - 439

DOI: https://doi.org/10.1017/S136672892200058X [Opens in a new window]
Creative Commons: This is an Open Access article, distributed under the terms of the Creative Commons Attribution licence (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted re-use, distribution and reproduction, provided the original article is properly cited.
Copyright: Copyright © The Author(s), 2022. Published by Cambridge University Press

Introduction

Co-speech gestures are body movements that accompany speech, and are temporally, semantically and pragmatically coordinated with speech (Kendon, Reference Kendon2004; McNeill, Reference McNeill1985). Manual co-speech gestures (i.e., co-speech gestures made with the hands) facilitate spoken language comprehension in monolingual first language (L1) and second language (L2) settings (e.g., Dahl & Ludvigsen, Reference Dahl and Ludvigsen2014; Hostetter, Reference Hostetter2011; Sueyoshi & Hardison, Reference Sueyoshi and Hardison2005) as they offer a parallel representation of meaning which can be redundant, supplementary, etc. Simultaneous interpreting (SI), an instance of extreme language use (Hervais-Adelman, Moser-Mercer, & Golestani, Reference Hervais-Adelman, Moser-Mercer and Golestani2015), involves simultaneous processing and comprehension of spoken language input (Seeber, Reference Seeber, Schwieter and Ferreira2017) and production of language output in another spoken languageFootnote ¹. Thus, one could expect simultaneous interpreters to benefit from co-speech gestures during language comprehension just like L1 and L2 speakers do. However, the fact that interpreters also produce verbal output while comprehending may modulate the effect of gesture on language comprehension. Yet there is some evidence suggesting that interpreting expertise might positively influence cognitive performance, e.g., dual-task performance (Strobach, Becker, Schubert, & Kühn, Reference Strobach, Becker, Schubert and Kühn2015) or cognitive flexibility (Yudes, Macizo, & Bajo, Reference Yudes, Macizo and Bajo2011), suggesting that interpreters might be better than other bilinguals at attending to visual and auditory input in parallel. However, empirical data are scarce on the potential influence of gestures on language comprehension in SI.

In this study we therefore strive to bridge this gap, and explore whether gestures influence language comprehension during SI. Specifically, we look at simultaneous interpreters’ comprehension of semantically related/unrelated co-speech manual gestures during SI and during passive viewing/listening, in comparison to a bilingual group with no interpreting experience.

The role of gestures in language comprehension

A growing body of work shows that co-speech gestures and speech are intimately related, forming a so-called integrated system (McNeill, Reference McNeill1985, Reference McNeill1992, Reference McNeill2005). Indeed, co-speech gestures and speech have been shown to develop, break down, and be processed in parallel (Capirci & Volterra, Reference Capirci and Volterra2008; Colletta, Guidetti, Capirci, Cristilli, Demir, Kunene-Nicolas, & Levine, Reference Colletta, Guidetti, Capirci, Cristilli, Demir, Kunene-Nicolas and Levine2015; Goldin-Meadow, Reference Goldin-Meadow2003; Graziano & Gullberg, Reference Graziano and Gullberg2018; Holle & Gunter, Reference Holle and Gunter2007; Kelly, Özyürek, & Maris, Reference Kelly, Özyürek and Maris2009; Mayberry & Jaques, Reference Mayberry and Jaques2000; Mayberry & Nicoladis, Reference Mayberry and Nicoladis2000; Rose, Reference Rose2006; Wu & Coulson, Reference Wu and Coulson2007; for a review of the integrated relationship between gesture and speech, see Kelly, Reference Kelly, Church, Alibali and Kelly2017).

Gestures can represent the properties of entities and events talked about drawing on iconicity or similarity of shape, size, movement (McNeill, Reference McNeill1992); e.g., when a speaker makes a large, circular gesture while saying, “It was a big, round one” (Church, Garber, & Rogalski, Reference Church, Garber and Rogalski2007, p. 138). Such representational gestures typically express information that is semantically related to concurrent speech. Other gestures can express pragmatic aspects of speech such as rhythm, speech acts, stance, or aspects of discourse structure (Kendon, Reference Kendon1995, Reference Kendon2004), such as when a speaker rotates both forearms outwards with extended fingers to a “palm up” position to display incapacity, powerlessness or indifference (Debras, Reference Debras2017). Such pragmatic gestures have a more complex semantic relationship to concurrent speech compared to representational gestures. There is considerable evidence that semantically related (representational) gestures facilitate spoken language comprehension in naïve listeners when the meaning in speech and gesture is congruent (see Hostetter, Reference Hostetter2011 for a review). When processing multimodal information, comprehenders build a single unified meaning representation without necessarily realising which particular channel the information came from (Cassell, McNeill, & McCullough, Reference Cassell, McNeill and McCullough1999; Gullberg & Kita, Reference Gullberg and Kita2009). Furthermore, priming studies using multimodal stimuli have revealed an interaction between semantically related gestures and speech, even when one modality is irrelevant to the experimental task (Kelly, Creigh, & Bartolotti, Reference Kelly, Creigh and Bartolotti2010; Kelly, Healey, Özyürek, & Holler, Reference Kelly, Healey, Özyürek and Holler2015; Langton & Bruce, Reference Langton and Bruce2000). The integrated-systems hypothesis, which is based on these observations, posits that gestures necessarily influence the processing of speech, while speech necessarily influences the processing of gestures (Kelly et al., Reference Kelly, Özyürek and Maris2009). Furthermore, empirical evidence suggests that co-speech gestures may also contribute to language comprehension in L2 speakers (Dahl & Ludvigsen, Reference Dahl and Ludvigsen2014; Sueyoshi & Hardison, Reference Sueyoshi and Hardison2005). Although the link between speech and gestures is thus undisputed (Gullberg, Reference Gullberg2006), speech-gesture integration during a complex task such as SI remains under-explored.

Multimodality in SI

SI is considered a complex process (Frauenfelder & Schriefers, Reference Frauenfelder and Schriefers1997; Moser-Mercer, Reference Moser-Mercer2000) since it combines concurrent spoken language comprehension and production in two distinct languages. From a cognitive point of view, the language comprehension component in SI is likely to share common features with other language comprehension tasks (Seeber, Reference Seeber, Schwieter and Ferreira2017). For example, just like during ordinary language comprehension, interpreters process speakers’ input while having access to various sources of visual information, including gestures (Galvão, Reference Galvão, Carapinha and Santos2013; Gieshoff, Reference Gieshoff2018; Seeber, Reference Seeber, Schwieter and Ferreira2017). Indeed, practitioners generally deem visual access necessary for successful interpretation (Bühler, Reference Bühler1985). More specifically, hand gestures and facial expressions are considered to be the most important visual sources of information, since they are viewed as facilitating understanding and emphasis (Rennert, Reference Rennert2008). What is more, visual access to speakers, including to their gestures, is enshrined in the working conditions issued by the International Association of Conference Interpreters (AIIC, 2007) and in ISO standards (ISO, 2016b, 2017). More generally, some have argued that interpreters should use any piece of information that can make language comprehension easier or faster, since comprehension is a key component of SI (Bühler, Reference Bühler1985). Moreover, there is some evidence suggesting that interpreting expertise might positively influence cognitive performance, e.g., dual-task performance (Strobach et al., Reference Strobach, Becker, Schubert and Kühn2015) or cognitive flexibility (Yudes et al., Reference Yudes, Macizo and Bajo2011), suggesting that interpreters might be better than other bilinguals at attending to visual and auditory input in parallel. Thus, even when engaging in SI, interpreters might be able to benefit from gestures just like other comprehenders do in L1 and L2 settings. To date, however, this assumption has not been empirically corroborated.

Until recently, the focus of translation and interpreting studies has been on the investigation of written and oral texts as verbal artifacts, meaning that written and spoken discourse has been studied in isolation from other non-verbal resources (González, Reference González, Bermann and Porter2014). This is reflected in many influential models of SI that focus on the verbal signal and have failed to give sufficient prominence to the integration of different channels. Some efforts have been made to document and empirically measure the impact of visual access to speakers on SI (Anderson, Reference Anderson, Lambert and Moser-Mercer1994; Bacigalupe, Reference Bacigalupe, Alvarez Lugris and Fernandez Ocampo1999; Balzani, Reference Balzani, Gran and Taylor1990; Rennert, Reference Rennert2008; Tommola & Lindholm, Reference Tommola, Lindholm and Tommola1995). For example, Anderson (Reference Anderson, Lambert and Moser-Mercer1994) carried out an experiment with twelve professional interpreters to assess the effect of visual access on SI, and found no significant difference between the audiovisual and the audio-only condition. Tommola and Lindholm (Reference Tommola, Lindholm and Tommola1995) used a similar setup with eight experienced interpreters, and found no significant effect of visual access either. However, the set-ups did not allow us to ascertain whether interpreters were attending to the visual stimuli in the audiovisual condition and, if they were, what cues they might have processed. Therefore, it remains unclear how interpreters actually allocate visual attention to (and how they process) specific visual cues, especially gestures.

Investigating multimodal processing in SI using eye-tracking

In many experimental paradigms, eye-tracking techniques allow for the minimally-invasive recording of perceivers’ visual behaviour toward gestures without compromising the ecological validity of the task (Gullberg & Holmqvist, Reference Gullberg and Holmqvist1999). In gesture studies, eye-tracking has been used in experiments focusing notably on the perception and processing of gestural information (Beattie, Webster, & Ross, Reference Beattie, Webster and Ross2010; Gullberg & Holmqvist, Reference Gullberg and Holmqvist1999, Reference Gullberg and Holmqvist2006; Gullberg & Kita, Reference Gullberg and Kita2009) as well as on the integration of gesture and speech during reference resolution (Campana, Silverman, Tanenhaus, Bennetto, & Packard, Reference Campana, Silverman, Tanenhaus, Bennetto and Packard2005). Gullberg and Holmqvist (Reference Gullberg and Holmqvist1999, Reference Gullberg and Holmqvist2006) established that, both in live and video conditions, addressees looked at the speaker's face the vast majority of the time while gestures were mainly perceived through peripheral vision; that said, gestural holds (a momentary cessation of movement in a gesture) and gestures that speakers themselves looked at attracted addressees’ fixations more frequently. Beattie et al. (Reference Beattie, Webster and Ross2010) found that short character-viewpoint gestures attracted more fixations than other gestures, suggesting that they are particularly information-rich.

As to interpreting studies, several studies have used eye-tracking to examine SI as a multimodal rather than a purely verbal process (Galvão & Rodrigues, Reference Galvão, Rodrigues, Diaz Cintas, Matamala and Neves2010; Seeber, Reference Seeber, Schwieter and Ferreira2017; Stachowiak-Szymczak, Reference Stachowiak-Szymczak2019). In an eye-tracking experiment using pictures rather than gestures, Stachowiak-Szymczak (Reference Stachowiak-Szymczak2019) found that visual and auditory input were integrated in SI, highlighting the multimodal nature of the task. Seeber (Reference Seeber2011) conducted an eye-tracking experiment relating interpreters’ fixations on visual information to the auditory content, concluding that interpreters do attend to visual cues, including gestured numbers.

To summarise, gestures and speech form an integrated system where both channels influence each other, and gestures have been shown to facilitate language comprehension in native and non-native speakers alike. While implying concurrent language comprehension and production in distinct languages, SI comprises a language comprehension component. If semantically related gestures have the potential to facilitate language comprehension during an extreme language task such as SI, interpreters should be expected to benefit from access to such gestures. The studies on visual access in SI conducted up until now have presented concomitant visual cues, not just gestures, and/or have not allowed for the establishment of a direct link between visual cues and language comprehension. Thus, it remains unclear whether gestures have the same influence during SI. Yet, the majority of interpreting practitioners, the main professional association and a growing number of scholars recognise the multimodal character of SI, including the potential positive influence of gestures. Moreover, evidence suggests that interpreters might be better than other bilinguals at attending to visual and auditory input in parallel.

The current study

This study aimed to investigate the potential facilitatory effects of semantically related gestures on simultaneous interpreters’ language comprehension. The first experiment looked at task-contingent differences in processing audiovisual signals. It compared how interpreters comprehend audiovisual signals during SI versus during passive viewing/listening as measured by a picture matching task, and examined the effect of semantically related gestures (target gesture condition), semantically unrelated gestures (control gesture condition), and the absence of gestures (no-gesture condition) on comprehension. We expected that a congruent speech-gesture meaning pair (semantically related gestures) would be more easily processed than one where the speech-gesture meaning relationship is less clear (semantically unrelated gestures). We thus hypothesised a facilitatory influence of semantically congruent gestures on language comprehension. Language comprehension was measured through response accuracy and reaction times in the picture-matching task. Using eye-tracking, we also measured overt visual attention to gestures, operationalised as total visual dwell time (the total duration of fixations on a particular area of interest) on the speaker's gesture space, a pre-defined area of interest in front of the speaker going from the speaker's shoulders to her hips, since this is where speakers usually gesture (McNeill, Reference McNeill1992, p. 86). Monitoring what participants look at during the stroke, the meaningful part of the gesture, enabled us to examine the extent to which overt visual attention to gestures correlates with response accuracy and reaction times. The experiment aimed to address the following questions:

1. Do simultaneous interpreters integrate gestural information during language comprehension? 2. If so, is such integration affected by task (cf. SI versus during passive viewing/listening)? 3. Do simultaneous interpreters visually attend to gestures?

The second experiment examined experience-contingent differences in processing audiovisual signals. It compared language comprehension during passive listening/viewing in two groups: an experimental group of professional simultaneous interpreters and a comparison group of professional translators (i.e., language professionals who change written words into another written language; the main difference with simultaneous interpreters being that they work with text rather than with speech in realtime and that there is no simultaneity requirement) without SI experience. The aim was to determine whether interpreters behave differently than other bilinguals due to their SI experience, which could have influenced their performance in the first experiment. The same variables were analysed as in the first experiment to address the following questions:

4. Do translators integrate gestural information during language comprehension in the same way as interpreters? 5. Do translators visually attend to gestures in the same way as interpreters?

First experiment

Method

Participants

Twenty-four professional conference interpreters participated in the studyFootnote Footnote Footnote ⁴ (see Table 1). They were recruited via e-mail describing the eligibility criteria. Participants completed an adapted version of the Language Experience and Proficiency Questionnaire (Marian, Blumenfeld, & Kaushanskaya, Reference Marian, Blumenfeld and Kaushanskaya2007). All participants had normal or corrected-to-normal vision and reported no language disorders. Participants’ L1 was French (A languageFootnote ⁵), their L2 English (A, B or C languageFootnote ⁶). Twenty-one of the 24 participants were either members of the International Association of Conference Interpreters (AIIC), or accredited by international organisations such as the United Nations, or both. The three remaining participants were professional conference interpreters based in Geneva.

Table 1. Background information provided in the language background questionnaire.

All participants gave written informed consent. The experiment was approved by the Faculty of Translation and Interpreting's Ethics Committee. No participant was involved in the norming of the stimuli.

Task and materials

Participants were asked to either simultaneously interpret (SI activity) or to watch (passive viewing/listening activity) short video clips of a speaker uttering two sentences (e.g., “Look at the terrace! Last Monday, the girl picked the lemon”). The second sentence was either accompanied by a semantically related gesture, a semantically unrelated gesture, or no gesture. Participants were then presented with two drawings corresponding to an action verb, one target drawing (e.g., picking a lemon) and one distractor (e.g., squeezing a lemon), and asked to choose the drawing corresponding to the video by pressing a button. There was no time limit. This picture-matching task was used to probe language comprehension. Accuracy and response times were recorded. Since they express meaning differently from both speech and gesture, drawings enabled us to implicitly probe gesture content.

Speech

We created a first set of 30 utterances following one of two patterns: adverbial phrase of time, agent, verb and patient (e.g., “Last Monday, the girl picked the lemon.”), or adverbial phrase of time, agent, verb, preposition and indication of location (e.g., “Two weeks ago, the boy swung on the rope”). The target word was the main verb. A short introductory sentence (e.g., “Look at the terrace”) was added to ensure interpreters would interpret simultaneously.

We then created a second set of 30 sentences replacing the verbs with equally plausible candidates (e.g., “Last Monday, the girl squeezed the lemon.” or “Two weeks ago, the boy climbed up the rope”). The resulting 60 sentences (word count: M = 11.6, SD = 0.8) were assigned to two matched stimulus lists. Target verb frequency indications were obtained from the Corpus of Contemporary American English (Davies, Reference Davies2008) and two lists of sentences were created to balance verb frequency. Mean verb frequency was 72,788 (SD = 79,898) in List A and 62,041 (SD = 74,694) in List B, with no significant difference across lists, p > .6. Sentence plausibility was rated separately by 28 French and 28 English speakers (n raters = 56) on 6-point Likert-type scales (from 1, “very implausible” to 6, “very plausible”). Mean sentence plausibility (the average of the French and the English rating) was 3 (SD = 0.9) in List A and 3 (SD = 0.9) in List B, with no significant difference across lists, p > .8. Raters who rated plausibility did not participate in the norming of pictures.

Gestures

We devised manual gestures to accompany the sentences in the semantically related gesture condition versus in the semantically unrelated gesture condition. Semantically related gestures were representational gestures corresponding to the content of the target verb. For example, for “squeeze the lemon”, the speaker performed a squeezing gesture: “right hand: hand half open in front of speaker, palm facing down, then fist closing with a rotation of the wrist” (see Figure 1a). Semantically related gestures depicted path rather than manner of movement in motion verbs (i.e., they showed a trajectory, e.g., going up, but did not provide information about manner of motion, e.g., no wiggling of fingers to indicate climbing).

Fig. 1. Examples of gestures. a. Character-viewpoint semantically related gesture for “squeezing”. b. Character-viewpoint semantically related gesture for “slicing”. c. Observer-viewpoint semantically related gesture for “swinging”. d. Semantically unrelated gesture for “squeezing”.

Semantically unrelated gestures were instantiated as pragmatic gestures (Kendon, Reference Kendon2004) with no semantic relationship to the target verb. We used five forms: the Open Hand Prone and Open Hand Supine families (Kendon, Reference Kendon2004, pp. 248–283), the ‘slice gesture’, and the ‘power grip’ (Streeck, Reference Streeck2008), and the ‘flick of the hand’ (McNeill, Reference McNeill1992, p. 16). For example, for “squeeze the lemon”, the speaker performed the gesture called “Open hand prone (‘palm down’) – vertical palm”. Here the speaker's palm and forearm are vertical so that the palm of the hand faces directly away from the speaker (Fig. 1d).

All sentences were recorded audiovisually by a right-handed female speaker of North-American English in a sound-proof recording studio in controlled lighting conditions. Three versions of each sentence pair were recorded: one in which the speaker did not gesture while uttering the sentences (no-gesture condition), one in which the speaker performed a pragmatic hand gesture while uttering the target verb (semantically unrelated gesture condition), and one in which she performed a representational hand gesture while uttering the target verb (semantically related gesture condition). Sentences were read from a prompter. The general intended gestural movement for each clip was described to the speaker but she was asked to perform her own version of them so that they would be as natural as possible. All gestures were performed with the speaker's dominant (right) hand. The mean duration of the audiovisual recordings was 4.8 seconds (SD = 0.4, range 3.7–5.5). Horizontally flipped versions of each video clip were created using Adobe Premiere Pro, so that the speaker also seemed to be gesturing with her non-dominant hand. This was to balance out a potential right-hand bias.

Once the clips had been recorded, gestures were coded to control for several features to ensure that these were evenly distributed across the lists and conditions (see Appendix S1, Supplementary Materials). Semantically related gestures were coded and controlled for viewpoint (Character versus Observer Viewpoint; McNeill, Reference McNeill1992). A character viewpoint incorporates the speaker's body into gesture space, with the speaker's hands representing the hands of a character: e.g., the speaker might move her hand as if she were slicing meat herself (Fig. 1b). In contrast, an observer-viewpoint gesture excludes the speaker's body from gesture space, and hands play the part of the character as a whole: the speaker might move her hand from left to right with a swinging movement to depict a character swinging on a rope (Fig. 1c).

Gestures were further coded for their timing relative to speech to ascertain that the stroke coincided temporally with the spoken verb form. We further coded gestures for ‘single’ versus ‘repeated stroke’. In single stroke gestures the stroke is performed once, while in repeated gestures the stroke is repeated twice. The number of repetitions was matched across lists for semantically related and unrelated gestures. Place of gestural articulation was coded following an adapted version of McNeill's schema of gesture space (McNeill, Reference McNeill1992, p. 89) as in Gullberg and Kita (Reference Gullberg and Kita2009). The ‘center-center’ and ‘center’ categories were merged into one ‘center’ category, while the ‘upper periphery’, ‘lower periphery’, etc., were merged into one ‘periphery’ category. Place of articulation was thus coded as either ‘center’, ‘periphery’ or ‘center-periphery’. Gestures were also coded for complexity of trajectory. Straight lines in any direction were coded as a ‘simple trajectory’ and more complex patterns were coded as ‘complex trajectories’ (e.g., when the stroke included a change of direction).

Verb duration was determined for each video clip by identifying verb onset, offset and preposition offset in the case of the Observer-Viewpoint category. Mean verb duration was comparable between semantically related items (M = 491 ms, SD = 98) and semantically unrelated items (M = 496 ms, SD = 112). Mean verb duration of no-gesture items was significantly shorter (M = 445 ms, SD = 112) than both semantically related gesture items (p < .05) and semantically unrelated gesture items (p < .05), possibly because the coordination of speech and gesture slowed down production.

Stroke duration was determined for all gestures and included post-stroke-holds, when present. Mean stroke duration did not differ significantly between semantically related (M = 585 ms, SD = 118) and unrelated gestures (M = 612 ms, SD = 152).

Semantically related and unrelated items were comparable in terms of stroke type and of complexity of trajectory. Both categories included 67% single stroke gestures (40 items) versus 33% repeated stroke gestures (20 items), and 83% simple trajectories (50 gestures) versus 17% complex trajectories (10 gestures). Items differed in place of articulation, as semantically unrelated items were mostly articulated in the ‘center-periphery’ area (65%, 39 items) whereas semantically related gestures were mostly performed centrally (58%, 35 gestures).

All gestures used in the experiment are described in Appendix S2 (Supplementary Materials).

Pictures

Black-and-white line drawings corresponding to the actions depicted in the target verbs were taken from the IPNP database (Szekely, Jacobsen, D'Amico, Devescovi, Andonova, Herron, Lu, Pechmann, Pléh, Wicha, Federmeier, Gerdjikova, Gutierrez, Hung, Hsu, Iyer, Kohnert, Mehotcheva, Orozco-Figueroa, Tzeng, Tzeng, Arévalo, Vargha, Butler, Buffington, & Bates, Reference Szekely, Jacobsen, D'Amico, Devescovi, Andonova, Herron, Lu, Pechmann, Pléh, Wicha, Federmeier, Gerdjikova, Gutierrez, Hung, Hsu, Iyer, Kohnert, Mehotcheva, Orozco-Figueroa, Tzeng, Tzeng, Arévalo, Vargha, Butler, Buffington and Bates2004). Since more drawings were needed, most of the pictures were created by an artist using the same format. The drawings were normed for name and concept agreement, familiarity and visual complexity as in Snodgrass and Vanderwart (Reference Snodgrass and Vanderwart1980) by 11 L1 English speakers and 10 L1 French speakers. Pictures that did not yield satisfactory measures were redrawn and normed by 10 L1 English speakers and 11 L1 French speakers (some raters were involved in both norming rounds; total n = 31). A sweepstake incentive of 50 CHF (for each language group) was made available.

Raters were asked to identify pictures as briefly and unambiguously as possible by typing in the first description (a verb) that came to mind. Concept agreement, which takes into account synonyms (e.g., “cut” and “carve” are acceptable answers for the target “slice”), was calculated as in Snodgrass and Vanderwart (Reference Snodgrass and Vanderwart1980). Picture pairs with concept agreement of over 70% were used. The same raters judged the familiarity of each picture – that is, the extent to which they came in contact with or thought about the concept. Concept familiarity was rated on a 5-point Likert-type scale (from 1 = “very unfamiliar” to 5 = “very familiar”). The same raters rated the complexity of each picture – that is, the amount of detail or intricacy of the drawings. Picture visual complexity was rated on a 5-point Likert-type scale (from 1 = “very simple” to 5 = “very complex”).

As shown in Appendix S1, Supplementary Materials, lists were balanced in terms of sentence plausibility, verb frequency, verb duration, gesture viewpoint, stroke type, stroke duration, place of articulation, gesture trajectory, concept agreement, concept familiarity, and visual complexity of the picture. Gesture conditions were balanced as to stroke type, stroke duration, gesture trajectory, but differed in terms of verb duration and place of articulation.

We created 24 blocks to accommodate the three gesture conditions, and counterbalance for gesture handedness (right/left hand), and target picture position (right/left side). Each block comprised four practice trials and 30 critical trials (10 of each gesture condition). Trial-type order was randomised in each block. Each session consisted of four blocks, two assigned to the SI activity, two to the passive viewing/listening activity. Activity order was counterbalanced across participants. Each participant saw List A and List B twice, but never saw the same individual trial twice. A total of 180 sentences were created, and each participant was presented with 60 experimental sentences in each task, meaning with 120 sentences in the whole experiment. Of these 120 sentences, one third corresponded to the condition without any gestures, one third to the semantically related gesture condition and one third to the semantically unrelated gesture condition.

Apparatus

Experimental tasks were completed in an ISO4043-compliant mobile interpreting booth (ISO, 2016a), programmed in SR Experiment-Builder® and deployed on a Mac Mini®. Visual stimuli were presented on a 23’’ (58.4 cm) HP E232 display with a refresh rate of 60 Hz, located approximately 75 cm from the participants. Auditory stimuli were played over an LBB 3443 Bosch headset. Eye-movement data were acquired with an SR Research EyeLink® 1000 desktop-mounted remote eye-tracking system with a sampling rate of 500 Hz. The eye-tracker camera was located in front of the monitor, leaving a distance of approximately 60 cm between participants’ eyes and the eye-tracker. Participants’ spoken interpretations were recorded using a Bosch DCN-IDESK-D interpreting console and fed back into the EyeLink to generate time-aligned stereophonic recordings of stimulus audio output and participant audio input. The input device was a VPixx Technologies RESPONSEPixx HANDHELD 5-button response box.

Procedure

Each session consisted of four blocks and lasted approximately one hour. Each block started with a standard 9-point calibration of the eye-tracker. After validation, participants completed a practice-trial session. During and at the end of the practice session, participants could ask questions. Participants were then instructed to launch the critical trials by pressing a button. Participants had timed three-minute breaks between blocks. During interpreted blocks, the experimenter monitored whether participants were interpreting the trial sentences simultaneously, and, if necessary, reminded participants. No feedback was given during the experiment. The experimenter monitored the eye-tracking display and recalibrated when necessary.

Passive viewing/listening activity – picture-matching task

Participants were asked to “keep looking at the screen while the video [was] being played” to enable the eye-tracker to follow their gaze. They were instructed to use the response box to “choose the picture that best correspond[ed] to the video” between two pictures. Upon launch of a trial, the participants saw a short video clip as described in the Task and materials section. This was followed by a blank screen (2,000 ms) upon which two pictures were presented, respectively on the left and right side of the screen. Once a picture was selected, a drift correction was performed to proceed to the next trial. The procedure is illustrated in Figure 2.

Fig. 2. Trial sequence during the passive viewing/listening versus SI activities.

SI activity – picture-matching task

Participants were asked to “start interpreting as soon as possible when the video start[ed]”, so that they would be engaged in simultaneous interpreting by the time the target verb was uttered. They were also instructed to “keep looking at the screen while the video [was] being played” to enable the eye-tracker to follow their gaze. They were asked to use the response box to “choose the picture that best correspond[ed] to the video” between two pictures. Upon launch of a trial, the participants saw a short video clip as described in the Task and materials section. This was followed by a blank screen (5,000 ms, which gave participants time to complete their interpretation) upon which two pictures were presented, respectively on the left and right side of the screen. Once a picture was selected, a drift correction was performed to proceed to the next trial. The procedure is illustrated in Figure 2.

Analysis

The analyses for the three dependent variables, response accuracy, reaction time (RT) and dwell time, were conducted separately and implemented in R (R Core Team, 2013) using the lme4 package (Bates, Mächler, Bolker, & Walker, Reference Bates, Mächler, Bolker and Walker2015).

Practice trials were not included in the analyses. Trials in which participants had not interpreted, only partially interpreted, or had not finished interpreting the stimuli by the onset of the picture-selection task were also excluded from the analysis, which led to the removal of 13.5% of interpreted trials (195 trials, 6.8% of the whole dataset). As one participant had systematically pressed the central button rather than the left or right button in the picture-matching task, the first two blocks of this testing session were excluded (instructions were followed after that).

Accuracy

Accuracy data were analysed using generalised linear mixed models (GLMM). The dataset was trimmed before completing the analyses. Responses above and below 3 SDs from the RT mean were considered outliers, which led to the removal of 2.7% (34 trials) of the data points in the SI activity dataset, and 2% (28 trials) of the passive viewing/listening activity dataset. Overall, 10% (287 trials) of all trials were excluded in the Accuracy data analysesFootnote ⁷.

GLMM analyses were conducted to test the relationship between accuracy and the fixed effects activity (2 levels, passive viewing/listening and simultaneous interpreting) and semantic match between speech and gesture (3 levels, semantically related gesture, semantically unrelated gesture, no gesture). An interaction term was set between activity type and semantic match. Subjects and items were entered as random effects with by-subject and by-item random intercepts as this was the maximal random structure supported by the data.

Reaction time

Linear mixed-effects model (LMM) analyses were run on RT data. Significance of effects was determined by assessing whether the associated t-statistics had absolute values ≥ 2. The dataset was trimmed before completing the analyses using the same approach as for the Accuracy data. Only accurate trials were used for the RT analyses, resulting in the exclusion of another 2.3% (60 trials) of RT data points. Overall, the excluded trials amounted to 12% (347 trials) of all trialsFootnote ⁸.

RTs were log-transformed and analysed using a LMM with the same fixed-effects structure as the GLMM. Subjects and items were entered as random effects with by-subject and by-item random intercepts.

Dwell time

LMM analyses were run on dwell time data. The same values as for RT data were used to determine significance of effects. Dwell time analyses were only performed on the two conditions that contained any gestures (66.6% of the data). Only accurate trials were used, resulting in the exclusion of 2.9% (52 trials) of the total data points. Two areas of interest were created, one comprising the speaker's head, the other one including gesture space, from the speaker's shoulders to her hips. Dwell time (in ms) was measured in each of the areas of interest during the gesture stroke. We tested the relationship between accuracy and the fixed effects activity (2 levels, passive viewing/listening and SI) and semantic match between speech and gesture (2 levels, semantically related gesture versus semantically unrelated gesture). An interaction term was set between activity type and semantic match. Subjects and items were entered as random effects with by-subject and by-item random intercepts as this was the maximal random structure supported by the data.

Results

Accuracy

Accuracy scores (Table 2A) were close to ceiling in both activities and all conditions.

Table 2. (A) Mean response-accuracy percentages. (B) Mean RT in ms.

The no-gesture condition and the passive viewing/listening activity were set as baselines. The interaction between activity type and gesture condition was not significant (β = −0.57, SE = 0.75, Z = −0.76, p = .45), indicating that the two fixed effects did not interact to affect accuracy. The result of the likelihood-ratio test used to compare the full to the reduced model (without interaction term) was also not significant (χ² (2) = 2.83, p = .24), confirming this result. The reduced model, which thus provided a better fit to the data, revealed that accuracy was not affected by activity type or gesture condition individually (SI: β = 0.03, SE = 0.28, Z = 0.10, p = .92; semantically unrelated gesture: β = −0.34, SE = 0.34, Z = −0.99, p = .32; semantically related gesture: β = −0.11, SE = 0.35, Z = −0.32, p = .75).

Setting the semantically unrelated gesture condition as baseline to explore potential effects of semantically related as compared to semantically unrelated gestures, using the relevel function in R, the interaction between activity type and gesture condition was not significant either (β = 0.61, SE = 0.66, Z = 0.93, p = .35), indicating that the two fixed effects did not interact to affect accuracy. The reduced model revealed that accuracy was also not affected by gesture condition individually (no-gesture: β = 0.34, SE = 0.34, Z = 0.99, p = .32; semantically related gesture: β = 0.23, SE = 0.33, Z = 0.69, p = .49).

Reaction time

Mean reaction times are presented in Table 2B.

The no-gesture condition and the passive viewing/listening activity were set as baselines. The interaction between activity type and gesture condition was not significant (β = 0.03, SE = 0.04, t = 0.92), indicating that the two variables did not interact to affect RTs. The result of the likelihood-ratio test used to compare the full to the reduced model (without interaction term) was also not significant (χ² (2) = 1.76, p = .41), confirming this result. The output of the reduced model revealed that RTs were not affected by activity type (β = −0.01, SE = 0.01, t = −0.34) or gesture condition (semantically unrelated gesture: β = 0.01, SE = 0.02, t = 0.82; semantically related gesture: β = −0.03, SE = 0.02, t = −1.52).

Setting the semantically unrelated gesture condition as baseline to explore potential effects of semantically related as compared to semantically unrelated gestures, the interaction between activity type and gesture condition was not significant either (β = 0.05, SE = 0.04, t = 1.29), indicating that the two fixed effects did not interact to affect RT. However, the reduced model revealed that RT was affected by gesture condition individually with semantically related gestures significantly affecting RTs (no gesture: β = −0.01, SE = 0.02, t = −0.82; semantically related gesture: β = −0.04, SE = 0.02, t = −2.34).

Dwell time

Visual attention to gesture was low (see Figure 3). This is in line with the literature: the speaker's face dominates as a kind of “default location” and addressees look directly at very few gestures (Gullberg & Holmqvist, Reference Gullberg and Holmqvist2006; Gullberg & Kita, Reference Gullberg and Kita2009). This is also in line with what we know of interpreters’ preference for the speaker's face during SI (Seeber, Reference Seeber2011).

Fig. 3. Dwell time on gesture space in ms according to gesture condition and activity type (SI = simultaneous interpreting, viewing = passive viewing/listening, unrelated = semantically unrelated gesture, related = semantically related gesture).

The passive viewing/listening activity and the control gesture condition were set as baselines. The interaction between activity type and gesture condition was significant (β = −27.13, SE = 12.09, t = −2.24), indicating that activity type and gesture condition interacted to affect dwell time on the speaker's gesture space. The result of the likelihood-ratio test used to compare the full to the reduced model was also significant (χ² (1) = 5.03, p = .03), indicating that the full model was a better fit to the data than the reduced model. The model output indicated that dwell time was significantly affected both by activity type (SI: β = −46.43, SE = 6.13, t = −7.58) and gesture condition (semantically related gesture: β = 32.98, SE = 6.14, t = 5.37).

Discussion

Accuracy was not affected by activity type, gesture condition or any interaction thereof. Thus, it appears that neither semantically related gestures nor semantically unrelated gestures had an effect on interpreters’ accuracy in either activity. That said, 13.5% of the interpreted trials had to be excluded from analysis since interpretations either had not been completed, were incomplete, or had not been completed by the time pictures were displayed. Stimuli that took interpreters more time to interpret or generated incomplete interpretations may have been associated with more difficulty. This might have caused higher error rates in the picture-matching task. Since the audio recordings stopped when the pictures were presented, however, a post-hoc analysis of these trials is impossible.

Activity type did not affect RTs. Nor was there any evidence that semantically related speech-gesture pairs made interpreters faster (in either activity) compared to utterances without gestures. When interpreters were presented with either semantically related or semantically unrelated gestures, however, semantically related gestures were associated with faster RTs (in either activity) than semantically unrelated gestures. This RT difference suggests that interpreters integrated gestures and that language comprehension was sensitive to gestures’ semantic relationship to the spoken utterance. It also raises the question of whether semantically related speech-gesture pairs accelerated comprehension or whether semantically unrelated speech-gesture pairs slowed it down compared to the baseline. Collapsed across activities, mean RTs were fastest in the semantically related gesture condition (M = 1,420 ms, SD = 605), slower in the no-gesture condition (M = 1,472 ms, SD = 686), and slightly slower still in the semantically unrelated gesture condition (M = 1,484 ms, SD = 651). Mean RTs in the no-gesture condition and the semantically unrelated gesture condition were very similar, and the difference between the semantically related gesture condition and the no-gesture condition approached significance in the LMM. This rather points to an acceleration effect of semantically related gestures than to a slow-down effect of semantically unrelated gestures compared to the baseline.

The dwell time measure points in the same direction. Interpreters attended to semantically related gestures significantly longer than to semantically unrelated gestures in both activities, which suggests that interpreters’ visual attention patterns, too, are sensitive to the semantic relationship between gesture and speech. Therefore, the gestures did not simply attract participants’ attention irrespective of their relevance in the utterance (as in Rayner, Reference Rayner1998).

Interpreters attended to gestures significantly longer during the passive viewing/listening activity than during SI, and dwell time was highest when interpreters attended to semantically related gestures during passive viewing/listening. This may reflect task demands in SI. However, they did attend longer to semantically related than to semantically unrelated gestures in this activity, too, which suggests that interpreters’ preference for the speaker's face during SI did not prevent them from attending to and integrating gestures, taking into account their semantic relationship with the utterance.

The experiment did not bring to light any language comprehension differences between passive viewing/listening and SI in terms of accuracy and reaction time: therefore, engaging in SI did not modulate language comprehension. However, interpreters might have honed their cognitive abilities due to their experience of SI, and interpreting experience may have had an effect on the interpreters’ behaviour. Other bilinguals without interpreting experience might behave differently from the tested interpreters, since interpreting expertise has been shown to positively influence cognitive performance, e.g., dual-task performance (Strobach et al., Reference Strobach, Becker, Schubert and Kühn2015), and cognitive flexibility (Yudes et al., Reference Yudes, Macizo and Bajo2011). To investigate this, a second experiment compared interpreters to bilinguals without interpreting experience.