1. Introduction
A spoken conversation can be operationalized as a highly interactive form of cooperative activity between at least two individuals. In that sense, it is more than an exact data transfer process, whereby a sender simply transmits information to a receiver, who then decodes the incoming message. The latter characterization of a spoken interaction does not do justice to the observation that an addressee is often more than a passive listener and is, in fact, co-responsible for a successful exchange of information (Clark, Reference Clark1996). Indeed, communication via speech can sometimes be a fuzzy endeavor, for example, because of a noisy channel or the fact that a speaker may not correctly estimate a listener’s prior knowledge about a specific state of affairs. As a result, it is typically the case that speakers and addressees seek and provide feedback on the smoothness of the interaction, to check whether information has successfully arrived at the other end of the communication chain. Accordingly, there is a growing interest in current models of spoken interaction regarding the systematicity of various types of feedback behavior.
In this article, we are specifically interested in the brief responses, called backchannels (Yngve, Reference Yngve1970), that addressees return during an interaction. Such backchannels, which can be verbal and non-verbal, serve as cues to show a speaker that an addressee is engaged and listening. Backchannels thus convey attention and interest to the speaker, and they can also regulate turn-taking (Gravano & Hirschberg, Reference Gravano and Hirschberg2011). While verbal backchannels include vocalizations (laugh, sigh, etc.), paraverbals (‘mm-hmm’, ‘uh-huh’, etc.) and short utterances (‘really’, ‘yeah’, ‘okay’), non-verbal backchannels consist of facial expressions, nodding, eye gaze and gestures. It has been shown that there is a marked difference between signals that serve as ‘go-on’ cues, that is, to make clear that the addressee has correctly processed the incoming message, and signals that highlight a possible communication problem so that a speaker–sender may have to repair a potential error (Granström et al., Reference Granström, House and Swerts2002; Krahmer et al., Reference Krahmer, Swerts, Theune and Weegels2002; Shimojima et al., Reference Shimojima, Katagiri, Koiso and Swerts2002).
In the literature, backchannels are distinguished from turn-taking cues. The intention of a speaker, when backchanneling, is to signal that the current speaker is still in charge of the turn, while the intention of a turn-taking cue is to interrupt the speaker and to take the speaking turn. Thus, backchannels can be viewed as a form of cooperative overlap or, from a turn-taking perspective, as a turn-yielding cue (Bertrand et al., Reference Bertrand, Ferré, Blache, Espesser and Rauzy2007).
1.1. Backchannel-inviting cues
It has been shown that the timing of backchannels is crucial to guarantee a smooth interaction (Gratch et al., Reference Gratch, Okhmatovskaia, Lamothe, Marsella, Morales, van der Werf and Morency2006; Poppe et al., Reference Poppe, Truong and Heylen2011). For instance, Gratch et al. (Reference Gratch, Okhmatovskaia, Lamothe, Marsella, Morales, van der Werf and Morency2006) demonstrated that a wrongly timed head nod from a listener can disrupt a speaker, which suggests that addressees typically are efficient at producing backchannels at the right points in an interaction. Indeed, research shows that backchannels occur at specific points in a conversation, for example, after the speaker gives a so-called backchannel-inviting cue (Gravano & Hirschberg, Reference Gravano and Hirschberg2011), also called backchannel-preceding cues (Levitan et al., Reference Levitan, Gravano and Hirschberg2011).
The specific behaviors that the speaker produces to transmit backchannel-inviting cues to elicit backchannel behavior from an addressee come in different forms, including the usage of specific prosodic patterns. Gravano and Hirschberg (Reference Gravano and Hirschberg2009) found that speakers use rising and falling intonations to elicit feedback. Similarly, Cathcart et al. (Reference Cathcart, Carletta and Klein2003) and Ward and Tsukahara (Reference Ward and Tsukahara2000) showed that listeners often provide a backchannel after speakers have lowered their pitch for at least 110 ms, and Cathcart et al. (Reference Cathcart, Carletta and Klein2003) showed that pauses in the speaker’s speech and also certain parts of speech are predictive of backchannels. Furthermore, Duncan (Reference Duncan1972) observed that backchannels occur after syntactically complete sentences, while Bavelas et al. (Reference Bavelas, Coates and Johnson2002) revealed that mutual gaze often occurs prior to a backchannel being produced. In line with this, Hjalmarsson and Oertel (Reference Hjalmarsson and Oertel2012) found that listeners were more likely to identify a backchannel-inviting cue when the speaker (an embodied conversational agent (ECA) in this case) made direct eye contact with the camera, as opposed to gazing away.
The probability that a listener will backchannel after a cue will increase when backchannel-inviting cues are stapled to form more complex signals (Gravano & Hirschberg, Reference Gravano and Hirschberg2011). In a similar vein, Hjalmarsson (Reference Hjalmarsson2011) showed that it appears to be the case with turn-taking and turn-yielding signals (signals closely related to backchannel-inviting cues, yet distinct) that the more cues are used to comprise the signal, the faster the reaction time of the interlocutor becomes. Speakers may not be aware of sending out backchannel-inviting cues, but listeners and observers are capable of picking up on those signals. Bavelas et al. (Reference Bavelas, Coates and Johnson2000) showed that listeners are even able to provide backchannels at the right moment when not attending to the content of the speech.
1.2. Backchannel opportunity points
Although speakers provide backchannel-inviting cues, it is up to the addressee to pick up on these cues and identify relevant moments in a conversation to produce backchannels. Those moments in a conversation, where it is appropriate for an addressee to provide some kind of listener feedback, are referred to as backchannel opportunity points (BOPs) (Gratch et al., Reference Gratch, Okhmatovskaia, Lamothe, Marsella, Morales, van der Werf and Morency2006). BOPs, which are also known as jump-in points (Morency et al., Reference Morency, De Kok and Gratch2008) and response opportunities (de Kok, Reference de Kok2013), are points in the interaction where an addressee could or would want to provide feedback in reaction to the speaker (de Kok & Heylen, Reference de Kok and Heylen2010). Prior studies show that not all BOPs are used by addressees to provide a backchannel (Kawahara et al., Reference Kawahara, Yamaguchi, Inoue, Takanashi and Ward2016; Poppe et al., Reference Poppe, Truong and Heylen2011). However, we lack detailed insight into the extent to which there is variability in the way addressees return feedback and regarding the different types of BOPs.
1.3. Current work
The goal of this study is to shed light on the variation that exists in backchannel behaviors across addressees and within an individual addressee. Specifically, we ask the following: (1) What types of behaviors are utilized by addressees to give feedback during BOPs? (2) How does feedback behavior differ across different addressees? (3) To what extent differs the behavior of addressees for the same BOP?
The fact that we expect there to be variability between and within addressees in their feedback behavior is in line with the previous findings that human beings do not have a fixed communication style. Speakers have been shown to adapt their way of speaking depending on the situational context, such as the type of addressee or the specific environment. Typically, speakers talk differently to children or adults and switch to a different style when they notice that their partner experiences some problems of understanding (e.g., because that person is not a native speaker) (Bortfeld & Brennan, Reference Bortfeld and Brennan1997). Along the same lines, there may be differences across addressees, for example, depending on personality traits or the mere fact that some addressees have more developed communicative skills (Williams et al., Reference Williams, Wharton and Jagoe2021). It could be expected that addressees may vary in how they produce backchannel behaviors, with some spots in the interaction eliciting stronger or more backchannels than others (e.g., because such a cue is felt to be more needed). Also, some addressees may be more extraverted or engaged so that one could expect differences across addressees as well.
Furthermore, the characteristics of a BOP can influence the type of behavior it elicits. A BOP placed at the end of a complete syntactically complete phrase is more likely to be seized than a BOP at the end of a syntactically incomplete phrase (Skantze et al., Reference Skantze, Hjalmarsson and Oertel2013). The dynamics of the interaction could also play a role. Benus et al. (Reference Benus, Gravano and Hirschberg2007) show that the liveliness of an interaction may influence the type of verbal backchannels a participant uses. In their study, mm-hm and uh-huh were more used during lively interactions, while okay and yeah were used more during less animated interactions. Orthogonal to this, the reason why not every BOP is seized could also be due to idiosyncratic differences between listeners. Huang and Gratch (Reference Huang and Gratch2012) examined the personalities of backchannel coders and explored the connection between these personalities and the frequency of identified BOPs. The results revealed a positive association between a higher number of identified BOPs and elevated levels of agreeableness, conscientiousness and openness. This is in line with the results of an earlier study that showed that different types of backchannel behavior correlate with various impressions of people’s specific personalities (Blomsma et al., Reference Blomsma, Skantze and Swerts2022).
Insight into the variability of audiovisual backchannel behavior is not only informative to understand how human–human communication proceeds, but it is also relevant for practical applications, such as models of human–computer interaction, specifically social robots and ECAs (Cassell et al., Reference Cassell, Sullivan, Prevost and Churchill2000), also known as socially interactive agents (SIAs) (Lugrin et al., Reference Lugrin, Pelachaud and Traum2021). In a similar manner to human–human interaction, it could be useful for ECAs to vary in the extent to which they backchannel, for example, depending on the type of user, context and application. It is also likely that inducing variability may render the interaction style of an ECA more natural and less monotonous, similar to the efforts to synthesize variability in speech and language generation systems (Gatt & Krahmer, Reference Gatt and Krahmer2018). However, modeling natural backchannel behavior for artificial entities is a non-trivial task for at least two reasons. One of the difficulties lies in detecting and appropriately responding to backchannel-inviting cues. Another difficulty is that due to backchannel behavior being idiosyncratic, it is not easy to define what a typical backchannel behavior should consist of for an ECA.
To investigate variation in backchannel behaviors and to answer the research questions above, we conducted a computational study based on the data collected in a human experiment that used the so-called O-Cam paradigm (Goodacre & Zadro, Reference Goodacre and Zadro2010). The current study is the first one in which the paradigm is used to examine backchannel behavior. The O-Cam paradigm was set up to allow comparisons between multiple addressees who are exposed to identical conversational data from the same speaker stimulus. The computational study consisted of two analyses. Analysis I examines the speaker stimulus, specifically the identification of BOPs, the categorization of those BOPs and the prosodic properties of the backchannel-inviting cues preceding the BOPs. Analysis II investigates the addressee’s behavior during the BOPs. We compared the behavior of the addressees across multiple channels (i.e., facial expressions, head movement and vocalizations) to examine the degree of variability between and within addressees.
2. Dataset
This study employed the materials of a database previously recorded during an experiment conducted by Brugel (Reference Brugel2014). The database consisted of (1) one video recording of the stimulus, henceforth ‘speaker’, and (2) the video recordings of 14 participants who were filmed during the experiment, henceforth ‘addressees’. Each video was 8.42 minutes long and contained 6.25 minutes of conversation, and the remaining time was used for game-related tasks such as preparing and answering questions (see explanation below). The number of participants is comparable to similar backchannel studies, including Krogsager et al. (Reference Krogsager, Segato and Rehm2014) and Poppe et al. (Reference Poppe, Truong, Reidsma and Heylen2010).
The recorded experiment was based on the O-Cam paradigm (Goodacre & Zadro, Reference Goodacre and Zadro2010), an experimental design that combines the advantages of online paradigms (i.e., highly controllable environment, easy to run) with the advantages of offline settings (i.e., high ecological validity). The core concept of the O-Cam paradigm is that a participant thinks that he/she is having a computer-mediated conversation with another participant (i.e., an interaction via a video conferencing setting), while, in reality, the other participant is a confederate whose video is pre-recorded. Certain manipulations are used in the setup to make a participant think it is a real-life conversation (Goodacre & Zadro, Reference Goodacre and Zadro2010). The O-Cam paradigm has been previously utilized to, for example, study the relationship between gender and leadership capabilities (Hong et al., Reference Hong, Schaafsma, van der Wijst and Plaat2014) and investigate the influence of smiling behavior (Mui et al., Reference Mui, Goudbeek, Roex, Spierts and Swerts2018).
The experiment reported by Brugel (Reference Brugel2014) was aimed to elicit feedback behavior from the participants. Each addressee played a Tangram game with the speaker (who was a pre-recorded confederate) via computer-mediated connection. During the experiment, the addressee was presented with four Tangram figures for 5 seconds, followed by a description of one of those Tangrams provided by the speaker. The participant’s task was to choose the figure from the four Tangram figures based on the description by the speaker. See Figure 1 for a visual illustration of the experiment. The experiment consisted of 11 rounds in which each time a different quadruple of Tangram figures would be used. The participants were told that the experiment was related to abstract thinking and that they were not allowed to ask questions since asking questions would make the game too simple. The confederate (the speaker) was not informed about the goal of the study in order to keep the experiment as ecologically valid as possible. Although task success was not measured, the primary objective of the game was to create a challenging experience with a task success rate close to 100%. This was intended to ensure that participants would fully concentrate on the speaker without feeling the need to ask additional questions for clarification, which would have been disruptive to the experimental setting, as the participant would then notice that the recorded confederate would not be responding to his/her questions. After the experiment, participants were asked whether they suspected that instead of a live interaction they were presented with a pre-recorded video of another person. The data of five participants were discarded because they answered positively, whereas one participant asked a question during the experiment, and thus, their data were also discarded.
3. Analysis I: speaker’s behavior
The first analysis regards only the speaker’s behavior to identify the BOPs and to analyze the audiovisual behavior of the speaker during the backchannel-inviting cues preceding the BOPs. The identified BOPs are subsequently used in Analysis II to investigate the addressee’s feedback behavior. An obvious approach to identify the BOPs would be to annotate the backchannel behavior for each of the addressee videos separately. However, such an approach comes with at least two disadvantages. As addressees do not necessarily utilize all BOPs to provide feedback, analyzing the addressees would thus not necessarily result in the identification of all BOPs. Furthermore, using the same data for selection and selective analysis would result in a circular analysis also known as ‘double dipping’ (Kriegeskorte et al., Reference Kriegeskorte, Simmons, Bellgowan and Baker2009). Therefore, we identified the BOPs based on the speaker stimulus.
3.1. Methods
3.1.1. BOP identification
We used parasocial consensus sampling (Heldner et al., Reference Heldner, Hjalmarsson and Edlund2013; Huang et al., Reference Huang, Morency and Gratch2010), which takes the advantage of the fact that humans, especially as a third-party observer, can aptly point out BOPs in a conversation (de Kok, Reference de Kok2013). The approach consisted of two steps: identification of possible BOPs by a jury of multiple judges, followed by the aggregation of the output of the jury to determine genuine BOPs. Genuine BOPs are those BOPs that are identified by at least a certain percentage of judges.
For the identification of BOPs, we used a human jury that consisted of 10 judges. Each judge watched the speaker video and identified each moment that he/she thought was appropriate to backchannel. Each judge was instructed in the same way. First, they were explained what backchanneling behavior is; namely, the listening signals one gives during a conversation include head nods and sounds like ‘uh-uh’, ‘hmm’ and ‘hm-hm’ and combinations of nods and sounds. Next, they were asked to watch the speaker video and to make a sound (e.g., ‘yes’) when he/she thought it was appropriate to backchannel, either verbally, non-verbally or both. The audio of the judge was recorded.
The aggregation of all the recordings of judges allowed us to determine, for each data point in the stimulus, the percentage of judges that thought that a specific moment was a BOP. BOPs that were agreed upon by a minimum percentage of judges were classified as genuine BOPs and selected for further analysis.
The minimum percentage is based on the expected number of backchannels in the recording. Poppe et al. (Reference Poppe, Truong and Heylen2011) state that one could expect from 6 to 12 backchannels per minute. Since our recording was 6.25 minutes, we therefore expected between 38 and 77 backchannels. The appropriate consensus level is determined as follows. First, the number of BOPs is calculated for each potential consensus level. That is, the number of BOPs that would be marked as genuine BOPs if that consensus level were used. Next, the final consensus level is selected based on the resulting BOP count. In this case, the BOP count should fall within the range of 38 and 77. In general, the relationship between consensus levels and number of BOPs could be seen as a monotonic non-increasing function: When the consensus level increases, the number of genuine BOPs either increases or stays constant; it never decreases.
All the recordings of judges were preprocessed with audacity (Audacity Team, 2021): We used a noise gate filter (250 ms attack and −12.50 dB grate threshold) to remove background noise and a 20 dB audio amplification to ensure that a judge was audible. Each recording was then converted to a binary time series with a resolution of 25 frames per second (FPS), in such a way that frames that contained a sound with an amplitude above 0.1 were converted to 1 and, otherwise, to 0. Although Huang et al. (Reference Huang, Morency and Gratch2010) used a resolution of 10 FPS, we decided to use 25 FPS as this matched with the FPS of both our video recording and the FaceReader encodings (as described in the subsequent section).
Because judges had to vocally indicate visual backchannels, which start on average 202 ms before a vocal backchannel (Wlodarczak et al., Reference Wlodarczak, Buschmeier, Malisz, Kopp and Wagner2012), the onset of each indication was set to 202 ms before the actual onset in order to correct for a potential delay. Each onset of a judge’s indication was converted to a potential BOP of the length of 1000 ms in line with Huang et al. (Reference Huang, Morency and Gratch2010). Finally, a time series was created with a resolution of 25 FPS, where each frame (i.e., sample) contained the number of judges that indicated a BOP for that frame.
3.1.2. BOP types: continuer and end-of-turn
To gain further insight into whether specific BOPs or BOP types affect the average addressee’s behavior, we subdivided the BOPs into two categories. Although each BOP functions as a moment for the addressee to acknowledge certain information, we conjecture that the urge to acknowledge is the strongest at the end of each game round. After all, no further information will follow the last BOP of a game round, and thus, the addressee should have enough information to answer the question at that point. And if not, the addressee should indicate that at that last BOP. Therefore, we estimate that the most expressive addressee’s behaviors will be observable at the last BOP of a round. Hence, we have created the following categories: (1) All BOPs that are the last of a round, we called this category the last backchannel of round (LBR), and (2) all other BOPs that are placed during a round, we called this category continuer. Given this categorization, the LBR category contained 11 cues and the continuer category contained 42 cues.
3.1.3. Backchannel-inviting cues
To verify that indeed the (visual) prosody is different for backchannel-inviting cues compared to the prosody used during the remaining part of the conversation, we analyzed the pitch properties, facial behavior and head movement of the speaker’s backchannel-inviting cues that preceded the identified BOPs. The cues were isolated by selecting the last 1000 ms of the speaker stimulus sound before the start of each BOP. However, there is no consensus on the length of such samples in literature; for example, Skantze (Reference Skantze2012) analyzed the last 200 ms of the voiced region for pitch, while Levitan et al. (Reference Levitan, Gravano and Hirschberg2011) reported longer sample lengths including 1000 ms. We choose 1000 ms to be on the safe side of finding a voiced part in the sample.
The pitch properties were extracted with Praat (Boersma & Weenink, Reference Boersma and Weenink2022). Of each sample, the F0 values (i.e., the fundamental pitch values) were extracted with a precision of 100 FPS. Trailing and leading frames that did not contain pitch information were discarded. For each sample, the average, minimum, maximum, amplitude (which is the maximum minus the minimum), average and form were obtained. The form was calculated by subtracting the average pitch of the second half of the sample from the average pitch of the first half of the sample, such that a negative number for form means an increasing pitch and a positive number means a decreasing pitch.
The facial behavior and head movements were analyzed based on the output of FaceReader 8 software (Noldus, 2019). The stimulus video was encoded with action units (AUs) based on the Facial Action Coding System (Ekman & Friesen, Reference Ekman and Friesen1978). Every frame of the videos was encoded with the following AUs: 1, 2, 4, 5, 6, 7, 9, 10, 12, 14, 15, 17, 18, 20, 23, 24, 25, 26, 27 and 45, and X, Y and Z coordinates were extracted for head orientation. Each AU can be scored for intensity on an ordinal scale from 0 (i.e., absence of an AU) to 5 (i.e., maximum intensity). For some frames in the dataset, FaceReader was unable to detect a face and thus was also unable to encode head position and/or AU activations.
Head nods were quantified for all backchannel-inviting cues following Otsuka and Tsumori (Reference Otsuka and Tsumori2020). Specifically, for head nods, we extracted amplitude and frequency. Amplitude equals the maximum tilt angle, that is, the difference between the minimum and maximum X rotation angles. Frequency is the sum of upward and downward peaks per second of the X rotation angle. To prevent that small noise-related changes in elevation direction would influence the frequency, we ignored upward and downward peaks that differed a maximum of 1 degree. In order to verify whether the backchannel-inviting cues differed from non-backchannel-inviting cues, each backchannel-inviting cue was paired with a randomly selected voice sample from the speaker stimulus. Paired t-tests were conducted between the obtained pitch properties, head movements and the average AU activation of the backchannel-inviting cues and the non-backchannel-inviting cues. The Bonferroni correction was applied for the multiple pairwise comparisons. Subsequently, the analyzed properties of the backchannel-inviting cues of the LBR category were compared with those of the continuer category. The two categories were compared with Welch’s t-test for significance and also corrected with Bonferroni.
3.2. Results
3.2.1. BOP identification
The number of identified backchannels per response level is depicted in Figure 2. Genuine (i.e., definite) BOPs were based on a consensus level of 30% (three coders) such that 53 BOPs were taken into account. The average duration of the 53 genuine BOPs was 934 ms (SD = 403 ms). The duration of a BOP was calculated starting from the initial timepoint with a consensus level of at least 30% and ending at the last timepoint where the consensus level was at least 30%.
3.2.2. Backchannel-inviting cues
The backchannel-inviting cues had a higher maximum pitch and a larger F0 range, compared to the randomly selected samples. There were no significant differences for average pitch, minimum and form. The highest pitch observed in backchannel-inviting cues was on average 350.36 Hz (SD = 106.94 Hz), while the highest pitch in the random samples had a lower average of 201.30 Hz (SD = 70.88 Hz). The F0 range for backchannel-inviting cues was on average 156.07 Hz (SD = 111.84 Hz), while the random samples had a lower average F0 range of 102.34 Hz (SD = 71.77 Hz). See Table 1 for all the results. The speaker’s head movements and facial behavior did not differ significantly between cues and non-cues and also not between LBR and continuer-related inviting cues (see Tables 2 and 3). For all comparisons, the Bonferroni correction was applied.
Note: Statistics are based on paired t-test analysis. All values are in Hertz. The Diff score is the result of subtracting the mean cue value from the mean random value. The Bonferroni correction was applied for the multiple pairwise comparisons with an alpha level of 0.01 (0.05/5 = 0.01). *p <. 05, **p <. 01, ***p <. 001.
Note: Statistics are based on paired t-test analysis. The Diff score is the result of subtracting the mean BOP value from the mean non-BOP value. No significant results were found in this analysis.
Note: Statistics are based on paired t-test analysis. The Diff score is the result of subtracting the mean BOP value from the mean non-BOP value. No significant results were found in this analysis.
The backchannel-inviting cues that preceded BOPs from the LBR category had a significantly lower average pitch, as compared to the cues that preceded the continuer category. The form was also markedly different, and LBRs have a downward going pitch on average, while the other cues have an upward going pitch on average. There were no significant differences for minimum, maximum and amplitude. For an overview of the results, see Table 4.
Note: Statistics are based on Welch’s t-test analysis. All values are in Hertz. The Bonferroni correction was applied for the multiple pairwise comparisons with an alpha level of 0.01 (0.05/5 = 0.01). *p <. 05, **p <. 01, ***p <. 001.
4. Analysis II: addressee’s behavior
In the following subsection, we first compare audiovisual feedback behavior at BOP and non-BOP spots in the spoken messages. Then, we focus on BOPs only to see to what extent we can observe variability in audiovisual feedback behavior within and between addressees.
4.1. Methods
4.1.1. Semi-automated measures of audiovisual behavior
The videos from the addressees were all encoded for facial expressions, head movements and vocal backchannels as follows. The head movements and facial behavior were analyzed analog to how the backchannel-inviting cues were analyzed (see Section 3.1.3). The vocal backchannels of the addressee videos were manually encoded by one coder with ELAN 6.0 encoding software (Wittenburg et al., Reference Wittenburg, Brugman, Russel, Klassmann and Sloetjes2006). The coder indicated the moments that an addressee made a sound and its duration. The vocal backchannels were quantified for all identified BOPs as follows: If an addressee made a sound during a BOP, the BOP was represented by 1 for the addressee or else by 0.
4.1.2. Comparisons of audiovisual behavior in BOPs versus non-BOPs
To understand whether the behavior of addressees differed between the BOPs and the rest of the conversation, we paired each BOP with a random non-BOP of the same length. A non-BOP is a moment in the conversation for which none of the judges thought it was a BOP. We compared the behavior of all addressees for a specific BOP with the behavior exhibited at the same non-BOP. Paired t-tests were carried out over all encoded channels. Pairs that contained frames that FaceReader was unable to encode were discarded. To determine how backchannel behavior differs across addressees, we calculated the average behavior per addressee and reported the average behavior across all addressees. The Bonferroni correction was applied for the multiple pairwise comparisons.
4.1.3. BOP types: continuer and end-of-turn
The differences in behavior between continuer BOPs and LBR BOPs were quantified with Welch’s t-test and corrected with the Bonferroni method.
4.2. Results
Overall, the behaviors during BOPs and non-BOPs differed markedly, except that we did not find any differences regarding minute facial expressions related to the AUs (see Table 5 and Figures 3–5). Even though the standard deviations for amplitude and frequency were high, there was a significant difference between the head movement of an addressee during a BOP and a non-BOP. On average, the frequency of head movement during a BOP was 3.43 upward/downward peaks per second, being 0.68 higher than the frequency in a non-BOP. The average amplitude was 5.95 degrees, which was 1.87 higher than that in a non-BOP. Across all BOP instances, 28% of the time, vocalizations were produced, while during non-BOP instances, this occurred only 3% of the time. The behavior of the facial muscles was generally the same during BOPs and non-BOPs (see Table 5 for an overview) and contained no significant differences.
Note: Statistics are based on paired t-test analysis. The Diff score is the result of subtracting the mean BOP value from the mean non-BOP value. The Bonferroni correction was applied to correct for 23 comparisons. Alpha was set to 0.002 (=0.05/23). *p <. 05, **p <. 01, ***p <. 001.
4.2.1. Variation of backchannel behaviors across addressees
There was substantial variation regarding different behaviors across addressees. Head movement differed among the addressees. Although the mean frequency of head movement was 3.46 upward/downward peaks per second across addressees, the most nodding addressee showed 5.47 upward/downward peaks per second on average, compared to 1.49 upward/downward peaks per second on average for the least nodding addressee. Amplitude was on average 5.97 degrees, with the addressee on the lowest end having an amplitude of 3.34 degrees on average, while the addressee on the highest end showed an amplitude of 9.65 degrees on average. Addressees vocalized 28% of the BOPs on average, while the least vocal addressee only vocalized 4% of the BOPs and the most vocal addressee vocalized 59% of all BOPs. AU activations also varied; for example, the AU with the highest variation (SD = 1.39) was Eyes Closed (AU43), followed by Lip Corner Puller (AU 12) (SD =. 77). See Table 6 for a complete overview and Figure 6 for a visual inspection.
4.2.2. Variation within addressees
The average addressee’s behavior also differed across the different BOPs. Figure 7 shows the distribution of behavior per BOP. On average, the frequency was 3.42 upward/downward peaks per second across BOPs. However, BOP 35 elicited an average frequency of 1.10 upward/downward peaks per second, while at BOP 51, addressees showed an average frequency of 6.25 upward/downward peaks per second. The amplitude also varied; the mean amplitude across all BOPs was 5.96, while the minimal average amplitude was 0.65 degrees at BOP 51, and the maximum average amplitude was 14.1 degrees at BOP 11. Some BOPs (e.g., 12, 16, 17) were never vocalized, while other BOPs were vocalized by 93% of the addressees (e.g., BOP 26). The effect of addressee-dependent behavior is visually inspected in Figure 8. For a full overview of the numbers, see Table 7.
4.2.3. Variation within addressees for different BOP types
The BOPs that are marked as LBR BOP elicit higher nodding amplitudes from the addressees than the continuer BOPs; furthermore, LBRs let to more vocalizations, on average 60% of the time, while during the remaining BOPs, addressees vocalized 20% of the time, on average. Nodding frequency is not different between the two types of BOPs. For all the results, see Table 8.
Note: Statistics are based on Welch’s t-test analysis. The Diff score is the result of subtracting the mean BOP value from the mean last backchannel of round (LBR) value. The Bonferroni correction was applied to correct for three comparisons. Alpha was set to 0.017 (=0.05/3). *p <. 05, **p <. 01, ***p <. 001.
5. Discussion
In this study, we were interested in a computational examination of the variability in backchannel behaviors among addressees. We looked at whether and how behavior varied during BOPs across and within addressees, specifically focusing on head movement, vocalizations and facial expressions produced by 14 addressees in a Tangram game. The game setup used the O-Cam paradigm, meaning that each addressee was exposed to exactly the same behaviors produced by the speaker. We showed that in general head movement and vocalization behavior significantly differed between BOPs and non-BOPs.
Nodding behavior and vocalizations were most pronounced during BOP instances, compared to non-BOP instances. However, it is notable that the amount of facial activity was generally the same during BOPs and non-BOPs, characterized by most AUs being activated at low-intensity levels. These low-intensity levels may be a consequence of the experimental setup, namely that addressees did not exhibit higher AU intensities because of the nature of interaction that the experimental setup (O-Cam paradigm) allowed. However, it is more likely that low facial activity during both BOPs and non-BOPs was the result of a general pattern, which is that during natural interactions people rarely produce exaggerated facial expressions (Blomsma et al., Reference Blomsma, Vaitonyte, Alimardani and Louwerse2020).
Further dissection of behavior during BOPs showed that there was person-specific variability. This between-addressee variability indicates that not every addressee demonstrated the same feedback behavior during BOPs. Some individuals were more discrete with their feedback behavior than others. In addition, the analysis indicated BOP-related differences. Some BOPs manifested more expressive behavior on average than others. Thus, in general, the timing of feedback behavior seems to adhere to certain rules. All addressees showed consistently different behavior during the BOPs than outside of the BOPs. However, the exact behavior seemed to be influenced by person-specific and BOP-related variables.
5.1. Variability between addressees
There was also variability between addressees. While all addressees nodded and vocalized during BOPs more than outside of them, there was variability in the extent to which addressees produced nodding and vocalizations during BOPs. Interestingly, the most vocal addressee produced a sound during more than half of the BOPs, a substantial difference from the least vocal addressee, who vocalized 14 times less. Likewise, the addressee with the smallest amplitude (addressee 14, with an average amplitude of 2.9) differed substantially from the person with the most pronounced amplitude (addressee 22 with an average amplitude of 7.4).
Given that all addressees were subject to the same experimental paradigm, the most likely source of this variation in backchannel behavior was the addressee’s tendencies related to personality characteristics. In other words, while most BOPs were amenable to nods and vocalizations, addressees differed in the manifestation of their listening behaviors. Prior research shows that backchannel behavior can be linked, to some extent, to the personality characteristics of a person as measured through the Big Five traits (Vinciarelli et al., Reference Vinciarelli, Chatziioannou and Esposito2015). In a follow-up experiment, we showed that the type of backchannel behavior indeed influences the personality perception of the listener. Listeners who produced head nods with a bigger amplitude are, for example, perceived as being more extraverted, compared to listeners whose head nods are smaller (Blomsma et al., Reference Blomsma, Skantze and Swerts2022).
Other factors could include gender, and research showed that women tend to backchannel with a higher frequency than men and that backchanneling occurs more frequently in Japanese than in American English (Dixon & Foster, Reference Dixon and Foster1998; Furo, Reference Furo2000; Maltz & Borker, Reference Maltz and Borker2018). Lastly, variability could also be (partly) caused by pure randomness.
In a future experiment, it would be valuable to take into account the characteristics of the addressee, such as personality, gender and cultural background, to identify factors that may play a role in producing the person-specific variability of feedback behavior. In addition, it would be beneficial to extend the length of the experiment to harvest more behavioral data from each addressee, which would allow to also shed light on potential intrapersonal variability, unrelated to BOP or person-specific characteristics. Although it is currently unknown what the time limits are of an o-cam experiment, we hypothesize that a longer experiment would result in more addressees that would find out that the speaker is pre-recorded.
5.2. Variability between BOPs
While nodding and vocalizations characterize spontaneous listening behavior, the high standard deviations regarding nodding behavior (i.e., amplitude and upward/downward peaks per second) suggest that different BOPs lead to the differing amount of nodding. This can be seen in Figure 8.
Regarding the current data, differing nodding patterns based on a BOP may partially be related to the fact that some Tangrams may have been more difficult to understand than others. That is, if an addressee quickly understood the description of a figure, they may have nodded more energetically compared to those instances where they doubted and hence nodded in a less pronounced fashion. This insight is related to the early research on non-verbal behavior conducted by Birdwhistell (Reference Birdwhistell1970), who showed that based on both the frequency of nods and their duration the involvement of an addressee was communicated differently. In particular, a single nod of 400 ms in duration acted as a strong affirmation of the speaker’s behavior, while a nod of 800 ms or longer signaled disbelief and even elicited interruptions on the part of the speaker. Overall, this demonstrates that the nature of backchannels varies as the interaction unfolds. Our research also put forward a difference between behavior shown during the last BOPs of a round and BOPs that were located during a round. The last BOP of a round may have acted as a feedback point, but also as marking the end of a round. The addressee was signaled at this BOP that the moment of choosing the correct Tangram was near, and therefore, the function of the BOP was perhaps different than the other BOPs. The speaker was more ‘asking’ for a confirmatory signal from the addressee, than an acknowledging feedback signal. Indeed, the backchannel-inviting cues from the speaker were clearly different when signaling the last BOP of a round, compared to other BOPs. The speaker was using a downward inflection when signaling the last BOP of the round, compared to an upward inflection when signaling other BOPs, and used a lower pitch rate on average. In return, addressees were more expressive during the LBRs, in the sense that they vocalized more often and showed a higher amplitude in their nodding behavior. That backchannel-inviting cues have a lower pitch at the end of a round and have a downward inflection is in line with Geluykens and Swerts (Reference Geluykens and Swerts1994), which show that speakers ‘reserve’ the low pitch to mark the end of a turn, while keep using a higher pitch in other cases to prevent that the turn is taken over by the opponent.
Given the variability in audiovisual behavior between various BOPs, we looked at a few cases in more detail to gain insight into possible reasons for the differences. In particular, we did some speculative analyses of BOP 26, which was vocalized by 92% of the addressees and received relatively frequent head nods (4.36), versus BOP 16, which was not vocalized by any of the addressees and not frequently marked by head nods (2.21). Comparing these two instances yields the impression that the strength of feedback cue (in terms of nodding and auditory backchanneling) is related to the degree to which the speaker signals that the information she provided is complete. BOP 26 occurs at the end of round 5, just after the speaker said, ‘That’s the one you have to pick. So, a square chimney and a triangle from the side of the house’. During this BOP of 1000 ms, the speaker is completely silent. The speaker appears to cue that she provided all the information the addressee needs to pick the correct Tangram figure and therefore expects a strong affirmative backchannel. BOP 16, on the contrary, occurs at the beginning of round 4, just at the end of short sentence from the speaker ‘These are more like birds.’, where it is clear that more details from the speaker are needed to be able to identify the Tangram she is describing. At this stage, a strong feedback cue from the addressee would seem less appropriate, given that the provided information is still incomplete, but an addressee may acknowledge that he/she is listening to the speaker and awaiting further details. Obviously, future work is needed to determine whether these impressions would generalize to more conversational contexts.
5.3. Division of labor
Given the results described above, it is interesting to compare the audiovisual behavior of the speaker with that of the addressee. Admittedly, given that we only recorded one speaker, our claims related to her role would have to be explored further in future work, but based on our analyses so far, it appears that our speaker more consistently makes use of auditory cues than visual cues to elicit feedback from her addressees. Indeed, while we find some prosodic differences between BOPs and non-BOPs, there are no significant differences in facial activity. Conversely, the addressees appear to exploit visual cues more regularly than vocalizations to return feedback after BOPs. In other words, given the broader set of audiovisual cues that function within an interaction, these results suggest that a speaker is more often using auditory features and the listener is more often making use of silent, visual cues, except for BOPs that occur at the final edge of a turn where a speaker is basically signaling that she has arrived at the end of her turn and will stop talking.
While this would have to be explored further in the future, these results point to a division of labor between auditory and visual cues in the feedback mechanism of a conversation, with the former being more typical for the speaker and the latter for the addressee. The advantage of being able to access multiple channels is that their use can be distributed over conversation partners so that they can exchange information in parallel. For instance, while one person is talking, the other can return visual feedback, such as affirmative head nods or expressions of surprise or misunderstanding, that do not interfere with the speech produced by the other as these are produced in silence. If instead dialog partners were to produce speech simultaneously, miscommunication might well result from the overlapping sound streams, because the speech by one person might mask that of the other (Swerts & Krahmer, Reference Swerts and Krahmer2020).
5.4. Embodied conversational agents
Understanding variation in backchannel behaviors across addressees is important for applications in ECAs. If for a large portion of backchannels nodding and vocalizations can be produced to show that one is engaged and listening, future research could investigate the conditions under which these behaviors are necessarily produced and vice versa the conditions when there is a slim chance that either a nod or a vocalization will occur. Understanding this balance between variability and stability of backchannel behaviors across a human–human conversation can help make artificial agents that can give flexible feedback and that come across natural in human–computer conversations. Moreover, person-specific variability may be used in an ECA to augment gender, personality and cultural characteristics. In other research, we have shown that indeed specific backchannel behavior in an ECA can elicit specific personality perceptions by its audience. We copy-synthesized the feedback behavior of different addressees during various BOPs onto an ECA and asked participants to indicate the perceived personality characteristics of the ECA. Among other conclusions, we found that a higher nodding amplitude during a BOP is perceived as more extroverted than a smaller nodding amplitude.
Previous studies show that when listening behaviors are missing or are poorly timed, communication is negatively affected and can go off the rails (Bavelas et al., Reference Bavelas, Coates and Johnson2000). The current findings suggest that there is no ‘one listening behavior’, but a variety of behaviors across different BOPs and across different addressees. And although nods and vocalizations are characteristic of spontaneous interactions, the degree to which they will be produced varies between addressees.