The two of us thank the commentators for their thoughtful reflections on our proposal. Although we cannot possibly address all of the issues they raised, we consider the ones most critical to our account.
Our proposal, briefly, is that social robots are designed to be depictions of social agents. People construe the humanoid robot Asimo, for example, as a depiction of a humanlike character, a social agent, who is able to engage with them in a genuine social interaction. We will call this the depiction model of social robots. In what follows, we have grouped the main issues raised in the commentaries into several categories – realism, experience with robots, communication with robots, anthropomorphism, and traits – and we take them up in turn. We end with a brief look at the past and future of social robots.
R1. Real social agents
A theme running through many of the commentaries is captured in the title of Eng, Chi, & Gray's (Eng et al.) commentary: People treat social robots as real social agents. Here are a few related claims (most italics are ours):
(1) Eng et al.: “Robots are not human beings, but neither are they mere depictions of social agents. Instead, they are seen as real social agents, especially when people interact with them.”
(2) Friedman & Tasimi: “While it might be difficult to confirm that social robots are viewed as depictions, it may be easier to confirm when they are viewed as genuine agents.”
(3) Orgs & Cross: “The robot [Smooth] performs a genuine social interaction: one physically embodied, social agent offers an object to another physically embodied, social agent. The robot therefore does not pose as a social agent, it is a genuine social agent.”
(4) Stibel & Barrett: “The question of what makes something a real, or actual, agent is largely a philosophical question. The question of when people perceive, or construe, an entity as a real agent is a question for psychology and anthropology.”
But what does it mean for a social agent to be real or genuine? About this there is much confusion.
R1.1 Real versus realistic
The bare nouns “tree,” “gun,” and “dog” ordinarily denote trees, guns, and dogs that are real or genuine. The phrases “artificial tree,” “fake gun,” and “toy dog,” on the other hand, denote depictions of trees, guns, and dogs as used, for example, in the theater or make-believe play. Artificial trees, fake guns, and toy dogs can be described as “realistic,” but not as “real” (Oxford English Dictionary). In these cases, “real” (or “genuine”) contrasts with “artificial.” For something to be construed as real or genuine, it needs to pass two tests:
Reality test: If an object or event is real, it “can be said to be real or actual, to be really or actually or literally occurring” (Goffman, Reference Goffman1974).
Realism test: If an object or event is a “real X,” it cannot be described as a “realistic X,” and vice versa.
The humanoid robot Nao, in our account, is therefore an artificial social agent that depicts a genuine social agent. And for us, a genuine social agent is a living being that is able to interact socially with humans. So, although Naoprop might be described as “realistic” for a social being, it would not be described as a “real” social being. Hortensius and Cross (Reference Hortensius and Cross2018, p. 93) agree: “We use the term artificial agents to refer to robots (including those that are machine-like, pet-like, or human-like).”
Many commentators, however, describe social robots such as Naoprop as genuine social agents. Orgs & Cross, contrary to Hortensius and Cross, insist “The robot [Smooth] does not pose as a social agent, it is a genuine social agent” even though it doesn't pass either test for being a real social agent.
Eng et al. say “Robots are not human beings, but neither are they mere depictions of social agents. Instead, they are seen as real social agents.” But for an object to be “seen as” a real X, the object cannot itself be a real X. That is, to say that a social robot is seen as a real social agent is to imply that it is not a real social agent – that it only looks like one, that it depicts a real social agent. So, despite their objections, Eng et al. appear to agree that social robots are depictions of social agents.
R1.2 Imagination
Still, people interacting with social robots imagine that they are interacting with real social agents. Owners of Sony's robot dog Aibo, for example, offer spontaneous reports such as these (Friedman, Kahn, & Hagman, Reference Friedman, Kahn and Hagman2003): “I feel I care about him as a pal.” “He always makes me feel better when things aren't so great.” “My emotional attachment to him … is strong enough that I consider him to be part of my family, that he's not just a ‘toy.’” Many of these “feelings,” we argue, are based on people's engagement, engrossment, or immersion in the scenes depicted, a phenomenon that has long been recognized in novels, films, and plays.
Novels and films, according to Chatman (Reference Chatman1980), divide into a discourse (the medium people process) and a story (the content they are to imagine), and people get engrossed in the content (Clark, Reference Clark1996, p. 366; see Clark & Van Der Wege, Reference Clark, Van Der Wege, Tannen, Hamilton and Schiffrin2015). As Gardner (Reference Gardner1985, p. 132) puts it about novels, “The writer's intent is that the reader fall through the printed page into the scene represented.”
Literary theorists, Gerrig notes, call this experience an aesthetic illusion. He cites Wolf (Reference Wolf, Hühn, Pier, Schmid and Schönert2009, p. 144), who said that the illusion “consists predominantly of a feeling, with variable intensity, of being imaginatively and emotionally immersed in a represented world and of experiencing this world in a way similar (but not identical) to real life.” In Wolf's view, being immersed can range from “the disinterested observation of an artifact” to “the complete immersion (‘psychological participation’) in the represented world” (p. 144). Much the same feelings arise, we suggest, with ventriloquist dummies, hand puppets, and social robots.
R1.3 Supporting imagination
Orgs & Cross assert that “Clark and Fischer link the quality of a social robot to its resemblance to a human agent.” This is a serious misreading. “All social robots,” we wrote, “represent nonstandard characters,” beings that one may never have met, seen, or thought about before. And the “quality of a social robot” is tied not to its literal resemblance to a character, but to the depictive devices by which the character is represented (see Clark & Van Der Wege, Reference Clark, Van Der Wege, Tannen, Hamilton and Schiffrin2015). Here are a few such devices that film makers, play directors, and puppeteers have used to immerse people in the scenes depicted.
R1.3.1 Perceptual illusions
A perceptual illusion is a perceptual experience that people feel is true even though they know, intellectually, that it is not true (see Gendler, Reference Gendler2008, on aliefs). Movies and plays are packed with them. Some are visual (fake blood, fake knives, stunt actors, artificial scenery), and others are auditory (Foley effects, diegetic sounds, dubbing).
Social robots also rely on perceptual illusions. One of these, fittingly, is the ventriloquist illusion: We hear a ventriloquist's voice as coming from the mouth of the dummy even though we know it is coming from the mouth of the ventriloquist. The robot Smooth's voice, for example, comes from its ears, and the robot Asimo's voice comes from its chest, and yet both voices are heard as coming from their mouths.
R1.3.2 Concealment
Movie and stage directors try to conceal elements of scenes that are not depictive – such as the lighting, director, and stage crew. One reason is to distinguish outside elements from the depiction proper. Another is to avoid distractions that interfere with people's immersion in a scene. With social robots, the machinery is generally concealed inside the body and head. Kismet the robot is an exception. It consists of eyes, ears, and lips hung from a visible metal frame, and sure enough, one child interpreted the metal frame as hair (Turkle, Breazeal, Dasté, & Scassellati, Reference Turkle, Breazeal, Dasté, Scassellati, Messaris and Humphreys2006, p. 324).
R1.3.3 Disguise
In early performances of Hamlet, Ophelia was played by boys disguised as women, and recently Hamlet has been played by women disguised as men. In robots, camera lenses may be disguised as eyes, microphones as neckpieces, and loudspeakers as ears, yet people seem not to notice, or care.
R1.3.4 Caricature
In animated cartoons, most characters are caricatures. Mickey Mouse's head, ears, and feet are exaggeratedly large, and so is Porky Pig's head. The actions depicted in cartoons are also caricatured (Thomas & Johnston, Reference Thomas and Johnston1995). Objects in motion are squashed and stretched in unnatural ways, and characters exaggerate their starts, stops, and other movements.
The same is true of social robots. Nao, for example, has huge arms, legs, and shoulders, but very small hips, a caricature of an adult male. Despite its antirealism, caricature is often helpful. Drawings of faces are recognized more quickly when caricatured than when veridical (Rhodes, Brennan, & Carey, Reference Rhodes, Brennan and Carey1987). And people at times prefer abstractly designed social robots over more realistic ones (Hegel, Reference Hegel2012).
R1.3.5 Feature selectivity
All depictions, we argued (target article, sect. 5.1), are selective about which features are depictive and which are not. Social robots are no exception. Nao, for example, has “eyes” and “ears” at the correct locations on its head, but it “sees” through cameras in its mouth and forehead and “hears” through microphones in its forehead. And Nao's “ears” are loudspeakers that depend on the ventriloquist illusion, a fact they must ignore. Nao's realistic “eyes” and “ears” help people see it as a depiction of a humanlike being even though these do not function as sense organs.
The point is that perceptual illusions, concealment, and disguise add realism to depictions whereas caricature and feature selectivity do not. And yet all five devices help engage, engross, or immerse people in the scenes depicted. The same techniques are exploited in social robots. Nao is a good example.
R.2. Real experiences and real accomplishments
Many commentaries (e.g., Eng et al., Stibel & Barrett, Orgs & Cross, and Vogeley) observe that people interacting with social robots have real experiences – real emotions, overt physical reactions, genuine feelings of responsibility – and that they accomplish real goals – from kicking balls back and forth to exchanging real information. Some of the commentaries take these as evidence against the depiction model, but that is a mistake.
The depiction model predicts just these phenomena. The character depicted by a social robot is selectively embodied in the robot's prop: The body of Nao's character coincides part-by-part with the body of Nao's prop, and the movements and speech of Nao's character coincide moment-by-moment with the motions and sounds of Nao's prop. People interact socially with Nao's character by engaging part-by-part and moment-by-moment with Nao's prop, and that leads to real experiences and real achievements. Several commentaries add evidence for this view.
R2.1 Real experiences
Reeves argues that engaging with social robots includes not only imagined experiences “guided by pretense,” but “natural experiences that are direct, automatic and independent of any thoughtful mapping between what is real and depicted.” At an IMAX film, we are surrounded by an 18 × 24 meter screen, and when the camera goes over a mountain ridge, we feel our stomachs rise into our throats. Reeves calls experiences like this “natural responses.” (See also Förster, Broz & Neerincx [Förster et al.], Seibt, and Vogeley.)
Natural responses, Reeves argues, are a product of what Kahneman (Reference Kahneman2011, Reference Kahneman2012) called system 1 thinking. System 1 is fast, intuitive, and involuntary, whereas system 2 is slow and “performs complex computations and intentional actions, mental as well as physical” (Kahneman, Reference Kahneman2012, p. 57). The time-locked processes we described in section 7.3 belong to system 1. The audience at Hamlet must imagine Hamlet stabbing Polonius at precisely the same time as the actor playing Hamlet is “stabbing” the actor playing Polonius. The percept-based processes we discussed in section 7.3 also belong to system 1. People are usually able, without reflection, to recognize an apple as an apple. Both of these processes would be natural responses.
Reeves concludes with a significant insight: “Much of the history of media technology is about inventions that promote natural responses.” This is especially clear in depiction-based media, such as film, television, video, and telepresence technology. Reeves' point applies just as forcefully to social robots. People's experience with them seems real because it is based in part on natural responses.
R2.2 Experiencing emotions
If system 1 “generates emotions” as Kahneman (Reference Kahneman2012, p. 57) argued, then emotions should be part of people's experience with depicted scenes. In research we cited (Gross, Fredrickson, & Levenson, Reference Gross, Fredrickson and Levenson1994), students viewing a clip from the film Steel Magnolias often became so immersed in the story that they got sad and cried. Clips from other films reliably evoke emotions ranging from amusement, anger, and contentment to disgust, fear, and sadness (Gross & Levenson, Reference Gross and Levenson1995).
These emotions, Blatter & Weber-Guskar note, are “cases of what others have called fictional emotions.” They cite Gendler and Kovakovich (Reference Gendler, Kovakovich and Kieran2006), who contrast “real” emotions, which are about real situations, with “fictional” emotions, which are about fictional ones. But emotions require a finer analysis.
R2.2.1 Emotions proper
For emotion theorists like Gross and colleagues, an emotion is real regardless of its source. The sadness experienced in Steel Magnolias was real even though it was about a fictional scene. Blatter & Weber-Guskar seem to agree: “In all these cases, we know that these characters are fictional, but having followed their stories we feel emotions that are very similar to the ones we would feel for real people.”
R2.2.2 Sources
Many emotions have identifiable sources. People fear a gunman, worry about the weather, and feel compassion for an ailing sister. Emotions like these depend on whether the source is real or fictional (à la Gendler) and whether it is present or not. As we noted in section 9.2, owners of the robot dog Aibo become emotionally attached to it even though they recognize that it is an artificial agent.
R2.2.3 Motivated reactions
People's emotions often motivate further actions. At Hamlet, the audience experiences shock when the actor playing Hamlet suddenly “stabs” the actor playing Polonius. For an actual stabbing, people would intervene or call for help, but the audience at Hamlet does not do this (see Walton, Reference Walton1978). People can also regulate or suppress their emotions; in horror films, they can cover their eyes or leave the building (Gross, Reference Gross, Lewis, Haviland-Jones and Barrett2008).
As Gerrig notes, people don't always suppress these reactions. When Clark watches crime films at home on television, he sometimes yells at characters “Watch out! Watch out!” despite frowns from his wife. Informal reports suggest that reactions like these are common. As Gerrig argues, “in the moment, the experience of an aesthetic illusion generates behavior that is real rather than pretense.” Clark construes his yelling as extensions of his emotional responses, which are real in the moment. So, when Aibo owners experience real emotional satisfaction in playing with their robots, that is in line with the depiction model.
R2.3 Continuity of experience
Aesthetic illusions with novels and films tend to be continuous. Once people immerse themselves in a story, they stay immersed in it until they break out of it. The same should hold in people's engagement with social robots.
Rueben takes a different view. “There are reasons to suspect that meta-cognition about construing social robots as depictions would be more difficult – or absent – than Clark and Fischer discuss.” He goes on: “The amount of time and effort that participants give to this reflection could greatly affect their responses.” But in novels, people's immersion is continuous; they don't have to re-immerse themselves with each new sentence or paragraph. The same is true with social robots. People don't need extra “time and effort” for “reflection” at each new step of their interaction with a robot. Once engaged with a robot, people can stay engaged.
A final point is due to Wrede, Vollmer & Krach (Wrede et al.) (see also Healey, Howes, Kempson, Mills, Purver, Gregoromichelaki, Eshghi & Hough [Healey et al.]). People find it easy to stay immersed in an imagined scene as long as it goes smoothly. But once they notice an inconsistency in the evidence, the spell is broken, they experience a breakdown, and the physical prop is foregrounded. The same happens in the theater when an actor forgets a line, the scenery falls over, or a stage light burns out. Breakdowns like these remind viewers of the base and depiction proper that lie behind the scene depicted. In our example, “When a robot stops moving, people must decide ‘Did the social agent fall asleep, or did the artifact's battery die?’” And people may go for one interpretation one minute and another the next (Fischer, Reference Fischer2021).
R3. Are social robots “mere depictions”?
In their commentary, Hortensius & Wiese say “[In] the framework put forward by Clark and Fischer … people construe social robots as mere depictions of social agents,” and others make similar comments (our italics):
(1) Eng et al.: “Research finds that – in real life – people also treat robots as actual social agents, not as mere depictions of social agents.” “[T]he more lifelike robots become, the more we treat them like social agents themselves, not mere depictions.”
(2) Förster et al.: “Firstly, we argue that robots do constitute a separate category of beings in people's minds rather than being mere depictions of nonrobotic characters.”
(3) Friedman & Tasimi: “How can we tell if other people think they are dealing with a genuine social agent or a mere depiction of one?” “So rather than viewing robots as mere depictions, people might instead see them as genuine agents with limited moral worth and limited mental capacities.”
(4) Gillath, Abumusab, Ai, Branicky, Davison, Rulo, Symons & Thomas (Gillath et al.): “Even if Clark and Fischer are correct in suggesting that bots are merely interactive depictions, the interactions people have with them are inevitably embedded within social contexts and involve specific social roles.”
(5) Girouard-Hallam & Danovitch: “A developmental and ontological perspective on social robots may move the conversation beyond mere depiction to a deeper understanding of the role social robots play in our daily lives and how we view them in turn.”
(6) Haber & Corriveau: “Taken together, these data support the idea that children engage with social robots in much the same way as they do with other social informants – and importantly, not simply as interactive depictions.”
(7) Malle & Zhao: “Most current social robots are mere depictions.” “[T]he more lifelike robots become, the more we treat them like social agents themselves, not mere depictions.”
(8) Seibt: “The authors’ core assumption, however, that social robots are always and only experienced as depictions of social agents, rather than as social agents proper, seems problematic.”
To describe a depiction as a mere depiction, however, is to ignore its content – the scene people are to get engrossed in. It would be absurd to describe King Lear, The Merchant of Venice, and Othello as mere depictions. Shakespeare's genius lay first in creating the stories about Lear, Shylock, and Othello and then in creating plays that immerse us in those stories. Yes, Shakespeare wrote magnificent dialogue, yet it is the stories that audiences get engrossed in and remember afterward. It is equally absurd to describe social robots as mere depictions. To do so devalues the thought and skill that engineers and social scientists put into their creations.
Comments like these reveal a misunderstanding of what it is to be a depiction, a concept characterized more fully in previous papers (Clark, Reference Clark1996, Reference Clark2016, Reference Clark and Hagoort2019; Clark & Gerrig, Reference Clark and Gerrig1990). Here we sort out some of those misunderstandings.
R3.1 Beyond “mere depictions”
At its heart, a depiction is a representation of something else – a sign that signifies an object. The philosopher Peirce (Reference Peirce, Hartshorne and Weiss1932, Reference Peirce1974; Atkin, Reference Atkin, Zalta and Nodelman2010) argued that signs come in three main types. (1) A symbol signifies an object by rule. The sound /hunt/ signifies “dog” for German speakers by a rule of German. (2) An index signifies an object by a physical connection with the object. An arrow is an index that signifies the thing it points at. And (3) an icon signifies an object by its perceptual resemblance to the object. A video of a dog barking at a squirrel is an icon that signifies the scene by its visual and auditory resemblance to that scene. Many signs, Peirce noted, are mixed signs – combinations of two or three of the basic types. (Petersen & Almor, alas, overlooked our citations in criticizing us for not tying our model to Peirce, signs, and icons.)
People communicate by producing signs for each other, and that leads to three methods of communicating: (1) describing things with symbols; (2) indicating things with indexes; and (3) depicting things with icons. Depicting is, therefore, a basic method of communication on a par with describing and indicating (Clark, Reference Clark2016). Most acts of communication are composites of these methods.
Acts of communication, in turn, are based on the recognition of a producer's intention in producing them (Grice, Reference Grice1957, Reference Grice1969). When Kate tells Lionel “I caught a fish this long (holding up two hands, palms in, 30 cm apart),” she intends him to recognize what she means by her gestural depiction – that the fish was 30 cm long – from two types of information: her perceptual display (the content, place, and timing of her gesture); and her intention, or purpose, in producing the display (as expressed in part in “I caught a fish this long”).
Kate and Lionel's actions aren't unilateral – separate and autonomous. They are bilateral – conditional on each other. Kate has to coordinate her display (its content, placement, and timing) with Lionel's interpretation of her display. In Grice's account, these two actions are conditional on each other even when they are displaced in space, time, or both. The same requirement holds for depictions such as social robots.
R3.2 On modern art
Depictions, in this view, have a purpose that recipients are intended to recognize. Orgs & Cross disagree. “Much of contemporary art,” they argue, “neither depicts nor represents.” As evidence, they cite delightful examples from performance art such as dance and theater that “dissolve the binary distinction between depicted and depictive scene, or acting and not-acting.”
Orgs & Cross's argument, however, ignores purpose. The very point of much modern art is to have no practical point. Artists have license to entertain, divert, or fascinate however they like and often leave purpose indeterminate. When Andy Warhol painted “Campbell's Soup Cans,” why did he depict Campbell's soup cans, and why 48 of them? Why did Jackson Pollack drip paint on a canvas in the patterns he did? Why are so many works entitled “Untitled”? Other artists play with trompe l'oeil, deceptive perspective, and visual illusions. Orgs & Cross's examples are of this ilk, and viewers appreciate them for what they are.
Everyday depictions, however, have a practical purpose. People base their interpretation of Michelangelo's David, Kate's depictive gesture, and social robots in part on what they believe the creators intended. Genuine depictions and artistic creations may live in the same world, but they are not all processed in the same way.
R4. Social robots must depict the way agents communicate
The depiction model holds that people construe social robots as depictions of social agents. But for an agent to be a social agent, it must be able to engage people in social interactions, and to do that, it must be able to communicate. As we put it in section 7, “it takes coordination for two individuals to interact with each other, and they cannot do that without communicating (Clark, Reference Clark1996).” A social robot must therefore depict not only the agent's physical appearance and movements, but also its acts of communication – its speech, hand gestures, head nods, head shakes, eye gaze, facial gestures, body postures, and body placements (see target article, sect. 7.1). That is, for the depiction of a genuine social agent to be complete, it must include the agent's communicative acts.
To our surprise, acts of communication are not even mentioned in most of the commentaries. Worse yet, they are in principle impossible in the alternative models based on anthropomorphism, embodiment, mind perception, and trait attributions. Here we briefly review our own previous work on communication (e.g., Clark, Reference Clark1996, Reference Clark2005, Reference Clark2021; Clark & Brennan, Reference Clark, Brennan, Resnick, Levine and Teasley1991; Clark & Henetz, Reference Clark, Henetz and Holtgraves2014; Clark & Schaefer, Reference Clark and Schaefer1989; Clark & Wilkes-Gibbs, Reference Clark and Wilkes-Gibbs1986; Fischer, Reference Fischer2016, Reference Fischer2021) and then show how it undercuts the alternative models.
R4.1 Joint activities
The basic idea is that whenever people interact socially, they do things together: They coordinate with each other in joint activities (Clark, Reference Clark1996, Reference Clark2005). And to coordinate with each other, they have to agree on their joint actions and positions, and that requires communication. Here is an example from two people assembling a TV stand (from Clark, Reference Clark2005):
Ann: Should we put this (holding up piece of wood) in, this, this little like kinda cross bar (pointing at a picture on the directions for the TV stand), like the T? like the I bar?
Burton: Yes, we can do that.
In turn 1, Ann proposes a joint position for the two of them, and in turn 2, Burton takes up her proposal and agrees to it. People can also reach agreement with gestures, which are any “visible acts of communication” (Kendon, Reference Kendon2004):
Burton: (extends hand with a peg to Ann)
Ann: (grasps the peg)
The principle is this: “It takes coordination for people to do things together, no matter how simple, and it takes communication to achieve that coordination” (Clark, Reference Clark2005, p. 507).
Social robots require the same techniques. In section 2.4 of our paper, we illustrated two exchanges between the robot Smooth and a woman named Beth:
Smooth: (presenting water glasses to Beth) Take your drink please.
Beth: (takes a glass of water)
Smooth: (faces Beth) Cheers!
Beth: (lifting her glass slightly) Cheers.
In the first pair of turns, Smooth offers Beth water, and she accepts his offer and takes a glass. In the second pair of turns, he makes a toast, and she takes it up and reciprocates. All four turns rely on both speech and gestures. In an example from Guo, Lenchner, Connell, Dholakia, and Muta (Reference Guo, Lenchner, Connell, Dholakia and Muta2017), a woman asked a robot concierge for the location of a bathroom. She posed the question in turn 1, and he took it up and answered it in turn 2:
Woman: Where is the bathroom.
Robot concierge: The bathroom is in aisle 13.
So, for humans and social robots to coordinate with each other, they must reach agreement on how to get things done together. A model of social robots unable to do this cannot be complete.
R4.2 Common ground
When two people communicate, they assume certain information to be part of their current common ground, and they add to that body of information with each new act of communication (Clark, Reference Clark1996, Ch. 4; Stalnaker, Reference Stalnaker and Cole1978). The same goes for social robots. As Carroll argued, “Future robots must effectively coordinate common ground with humans.”
Common ground comes in two main types, and social robots need to track both:
(1) Personal common ground is information people establish based on their joint experiences – what they see, do, and communicate with each other. Suppose a woman named Jane asks an actual concierge, “Where is the bathroom?” and he answers “The bathroom is in aisle 13.” With her question, the two of them would add her request to their current common ground, and with his answer, they would add the location of the bathroom.
(2) Communal common ground is information people share as members of the same cultural communities, such as their nationality, occupation, language, gender, age cohort, or residence. Although Beth, for example, spoke to her friends in Danish, she took for granted that she and Smooth both knew English and spoke to him in English.
For face-to-face coordination to go smoothly, communication must also be reliable. That requires a process called grounding: People in joint activities try to establish, as they go along, the mutual belief that they have understood each other well enough for current purposes (Clark, Reference Clark1996; Clark & Brennan, Reference Clark, Brennan, Resnick, Levine and Teasley1991; Clark & Schaefer, Reference Clark and Schaefer1989; Clark & Wilkes-Gibbs, Reference Clark and Wilkes-Gibbs1986). Speakers monitor their conversations both for evidence of success (e.g., “uh huh,” “good God,” “oh,” and nods from addressees) and for evidence of failure (e.g., misunderstandings that need repairing). When one woman asked Guo et al.'s robot concierge something he couldn't understand, he cleared it up before going on:
Woman: I need to powder my nose. (non-recognized question)
Robot concierge: Can you rephrase the question?
Woman: Where is the bathroom.
Robot concierge: The bathroom is in aisle 13.
The side sequence in turns 2 and 3 is one of many strategies people use for repairs (Dingemanse et al., Reference Dingemanse, Roberts, Baranova, Blythe, Drew, Floyd and Enfield2015; Schegloff, Jefferson, & Sacks, Reference Schegloff, Jefferson and Sacks1977). Social robots, then, also need strategies for tracking success and failure in their social interactions (see Healey et al. and Wrede et al.).
The problem, as we will show, is that alternative models of social robots have no means for accumulating common ground or for grounding what they do and say.
R4.3 Social agents are individuals
In communication, common ground is accumulated by individuals and not by types of individuals (Clark, Reference Clark1996, Reference Clark2021). When the actual concierge answered Jane's question, he tried to add new information to the common ground he shared with Jane the individual and not with some generalized person. He tried to anchor his references (“the bathroom” and “aisle 13”) to an actual bathroom and an actual aisle he assumed was in his and her current common ground. It is no different for people communicating with a robot concierge.
Individual entities are fundamentally different from types of entities. Thoughts about “Jane” and “the concierge” are about the individuals they index. Thoughts about “woman” and “concierge,” in contrast, are about the types of individuals they describe. Crucially, an indexical thought cannot be reduced to a set of descriptive thoughts (Perry, Reference Perry1979, Reference Perry1993; Recanati, Reference Recanati2012, Reference Recanati2013). An individual like Jane cannot be represented as merely a bundle of attributes. Yet that is the assumption behind many models of social robots based on anthropomorphism, mind perception, and trait attributions (e.g., Girouard-Hallam & Danovitch, Orgs & Cross, Ziemke & Thellman). The point may seem technical, but it is a significant strike against those models.
R4.4 Interim summary
In short, for a robot to be a social robot, it must represent a real social agent, an individual, able to engage humans in joint activities. The agent must be able to:
(1) Coordinate with humans in joint activities, however restricted the activities,
(2) Communicate with humans well enough to advance these activities,
(3) Accumulate common ground with individual humans as these activities advance (as Carroll suggests),
(4) Ground what gets said and done well enough for current purposes.
A number of proposals in the commentaries are incompatible with these features. In the next two sections, we take up four of these proposals.
R5. Anthropomorphism
Petersen & Almor make a remarkable claim in the title of their commentary: “Anthropomorphism, not depiction, explains interaction with social robots.” But in support of their position, they treat “social responses” and “social behaviors” as if they were genuine “social interactions,” and they aren't. Citing Airenti (Reference Airenti2018), Petersen & Almor write (with our italics):
For example, when a car engine fails to start, it is not uncommon for the would-be driver to engage in begging, chastising, or other social behaviors directed towards the car. It is difficult to argue that the car is a depiction of a social agent. Rather, Airenti argues that the interactive situation itself, in this case noncooperation, is sufficient to provoke a social response.
But anthropomorphizing a car doesn't turn the car into a social agent either artificial or real. The driver (call him Joe) directs actions toward the car, but the car just sits there. The two of them don't coordinate with each other. And when Joe “chastises” the car, he is not communicating with it. Whatever Petersen & Almor's “interactive situation” is, it is not a social interaction.
Ziemke & Thellman give a vivid example with driverless cars, but it, too, has problems.
As a pedestrian encountering a driverless car at a crosswalk, you might be asking yourself: Has that car seen me? Does it understand I want to cross the road? Does it intend to stop for me? … [I]t would also seem more straightforward to view the pedestrian as interacting with the car in front of it – rather than interacting with some internal representation or an imagined depicted character.
Here again, the interaction is not a social interaction. The car and pedestrian do not coordinate with each other as two social agents. The car is designed to predict what the pedestrian will do, and the pedestrian tries to predict what the car will do, but they do not communicate with each other about that. The situation is competitive, not cooperative. With social robots, people coordinate by communicating with the social agents the robots depict.
R5.1 Problems with anthropomorphism
These two examples illustrate serious problems for anthropomorphizing as an account of social robots.
R5.1.1 Unilateral versus bilateral interpretations
Anthropomorphizing a thing – viewing it as human – is a unilateral action, which people perform on their own. Joe was free to imbue the car with any features he liked, and he chose human ones. But interpreting a depiction is a bilateral action, which also takes account of what the depiction is intended to represent. Viewers cannot anthropomorphize Michelangelo's David any way they like, and they know that. They try to interpret it as Michelangelo intended them to – as a depiction of the biblical David.
R5.1.2 Individual agents
Anthropomorphizing creates types of humans and not individuals. Yet, as we noted, it takes individual agents to coordinate with other humans, communicate with them, accumulate common ground with them, and ground what they say and do (Clark, Reference Clark2021).
R5.1.3 Communication
Anthropomorphizing an entity does not specify how it coordinates with others with speech and gestures.
R5.1.4 Nonstandard characters
To anthropomorphize an entity is to imbue it with human features. But many social robots are a mix of human, animal, and other features. “All social robots” we wrote (contra Caruana & Cross), “represent nonstandard characters.” They are “best viewed as composite characters – combinations of disparate physical and psychological attributes.” The species they belong to don't come prefabricated. They have to be constructed. Anthropomorphizing simply cannot create the range of creatures that social robots represent.
These problems challenge in principle any proposal about social robots that relies on anthropomorphizing or mind perception (Bigman, Surdel & Ferguson [Bigman et al.]; Blatter & Webber-Guskar; Carroll; Caruana & Cross; Doyle & Hodges; Eng et al., Goldman, Baumann & Poulin-Dubois; Orgs & Cross, and Ravikumar, Bowen & Anderson [Ravikumar et al.]).
R5.2 Embodiment
Ravikumar et al. argue for treating social robots as embodiments of social robots: “[E]ven if social robots are interactive depictions, people need not mentally represent them as such. Rather, people can directly engage with the opportunities for action or affordances that such robots/depictions offer to them.” This position, however, also has problems.
Suppose Brigitte saw her old friend Alain and wanted to talk to him. She knew him well enough to assume they shared a great deal of common ground, such as how to approach, hug, kiss, and gossip with each other. She wouldn't have known how to approach, hug, kiss, or gossip with an anonymous body. She needed to know it was Alain. Brigitte would have the same problem with the robot Asimo. Identifying an entity as a body is not enough to engage with it even in what Ravikumar et al. called “sociocultural settings.”
Embodiment also doesn't distinguish features that allow affordances from features that do not. As we noted, “Observers of a depiction implicitly realize that only some of its features are depictive,” and only the depictive features afford the right inferences. Asimoprop's hand depicts a real hand, which affords handshakes, but Asimoprop's ears happen not to contain senses of hearing, so they do not afford headphones. Asimoprop's lack of a mouth doesn't afford speaking, yet Asimochar is able to speak. Discrepancies like these differ from robot to robot. If so, how can people “directly engage with the opportunities for action or affordances that such robots/depictions offer to them”?
Communicating with a robot is even more of a challenge. With Asimo, should Brigitte speak French, use French gestures, and kiss him on both cheeks, as she would with Alain? The affordances availed by Asimoprop's body offer no answers.
R5.3 Intentional stance
As the commentaries by Veit and Browning and Ziemke and Thellman noted, Dennett (Reference Dennett1987, Reference Dennett1988) proposed two strategies, or stances, that attribute intentions to systems such as social robots:
The intentional stance is the strategy of prediction and explanation that attributes beliefs, desires, and other “intentional” states to systems – living and nonliving – and predicts future behavior from what it would be rational for an agent to do, given those beliefs and desires. (1987, p. 495)
In the design stance, one predicts the behavior of a system by assuming that it has a certain design (is composed of elements with functions) and that it will behave as it is designed to behave under various circumstances. (1988, p. 496)
There is merit in both stances. For Asimo, the intentional stance applies to the social agent it represents (Asimochar), and the design stance applies to Asimo's physical design (Asimoprop).
These two stances, however, are unilateral interpretations and not the bilateral ones needed for social robots. It isn't enough to attribute certain mental states to Asimochar. People must attribute the mental states they believe they were intended to attribute to Asimochar. More than that, their interpretation of Asimochar must be custom-built for the nonstandard individual that Asimo depicts.
R6 Trait attributions
Bigman et al. entitle their commentary, “Trait attribution explains human–robot interactions,” and others agree with their claim (e.g., Eng et al., Ziemke & Thellman). But as we noted earlier, individuals such as Asimochar cannot in principle be reduced to bundles of traits, so models based on trait attributions face problems from the start. Trait attributions may be useful in describing or designing social robots (see Ziemke & Thellman), but that doesn't allow bundles of traits to count as models of social robots. Alas, trait attributions have other problems as well.
R6.1 Measuring traits
Traits are often studied by asking people to rate how much human attributes apply to nonhuman entities (see Epley, Waytz, & Cacioppo, Reference Epley, Waytz and Cacioppo2007; Gray, Gray, & Wegner, Reference Gray, Gray and Wegner2007; Reeves, Hancock, & Liu, Reference Reeves, Hancock and Liu2020; Weisman, Dweck, & Markman, Reference Weisman, Dweck and Markman2017; see Thellman, de Graaf, & Ziemke, Reference Thellman, de Graaf and Ziemke2022). In one study (Gray et al., Reference Gray, Gray and Wegner2007), participants rated 13 “characters,” which ranged from a baby, a fetus, a dead woman, and a frog to God, “you,” and a robot. All but the fetus were given proper names. People rated the entities on dimensions of “experience” (e.g., hunger, pain, fear, pride) and “agency” (e.g., self-control, morality, memory).
What people rated, however, weren't objects they had interacted with. They were static photos (e.g., Phillips, Ullman, de Graaf, & Malle, Reference Phillips, Ullman, de Graaf and Malle2017; Reeves et al., Reference Reeves, Hancock and Liu2020; Ruijten, Reference Ruijten2015), videos, labels (Lencioni, Carpinella, Rabuffetti, Marzegan, & Ferrarin, Reference Lencioni, Carpinella, Rabuffetti, Marzegan and Ferrarin2019), or descriptions of such objects. Here is Gray et al.'s description of their robot:
Kismet. Kismet is part of a new class of “sociable” robots that can engage people in natural interaction. To do this, Kismet observes a variety of natural social signals from sound and sight, and delivers his own signals back to the human partner through gaze direction, facial expression, body posture, and vocal babbles.
Clearly, this paragraph isn't about Kismetbase or Kismetprop, but about Kismetchar. Only the character would have a proper name, be male, “engage people in natural interaction,” “observe … social signals,” “deliver his own signals,” direct his gaze, and produce a “facial expression,” “body posture,” and “vocal babbles.” It is no surprise that Kismetchar was judged to have agency.
What if people had been asked whether Kismet's machinery experienced hunger, pain, or fear, or possessed pride or morality? Their ratings would surely have changed. Even if they thought Kismet's machinery experienced “hunger” (as when its battery died), they would have based their judgment on a metaphorical, not literal, interpretation of hunger. Metaphors, indeed, are a widespread issue.
R6.2 Metaphor problems
People seem willing to attribute “mental states” and “minds” to all sorts of artifacts. These include not only cars (Petersen & Almor, Ziemke & Thellman), but gadgets. A study by Epley, Akalis, Waytz, and Cacioppo (Reference Epley, Akalis, Waytz and Cacioppo2008) examined five gadgets, including “Clocky (a wheeled alarm clock that ‘runs away’ so that you must get up to turn it off)” and “Pillow Mate (a torso-shaped pillow that can be programmed to give a ‘hug').” People were asked to rate “the extent to which the gadget had ‘a mind of its own,’ had ‘intentions,’ had ‘free will,’ had ‘consciousness,’ and ‘experienced emotions.’”
But what were these people rating? When asked about the extent to which Pillow Mate had “a mind of its own” or “free will,” they were forced to interpret “mind” and “free will” as metaphors. If they had been asked whether Pillow Mate really or actually or literally had a mind of its own or free will, they, like us, would have said no (see Thellman et al., Reference Thellman, de Graaf and Ziemke2022). Likewise, if Joe the driver had been asked if he was really or actually or literally chastising the car, he, too, would have said no. And so would the pedestrian when asked whether the driverless car could really or actually or literally “see” him or her, “understand” things, or have “intentions”? (Did Romeo think that Juliet was really or actually or literally “the sun”?)
What we have here are metaphors: “hunger,” “a mind of its own,” “intentions,” “free will,” “consciousness,” “emotions,” “chastising,” “see,” and “understand.” It is a mistake to equate metaphorical attributions like these with their literal counterparts. They are not equivalent, and treating them as equivalent leads to misleading claims about both traits and social robots.
R7. Depicting is universal
A theme running through many commentaries is that depictions are exotic – too complex for people to use and understand easily (cf. Rueben). Keeping track of two layers, the depiction proper and the scene depicted, takes too much metacognitive effort. But nothing could be further from the truth. Depicting is a basic method of communication, and it is everywhere.
To begin with, depictions that people perform in conversation, such as direct quotations, iconic gestures, facial gestures, and full-scale demonstrations, are part of all languages. Depicting is also the basis for ideophones such as meow, cock-a-doodle-doo, and oink-oink, and these, too, are part of all languages (Dingemanse, Reference Dingemanse2013). And children begin to use performed depictions from as young as 18 months of age (Clark & Kelly, Reference Clark, Kelly, Morgenstern and Goldin-Meadow2021). Conclusion: People everywhere use and understand performed depictions as part of everyday communication and from an early age.
Other types of depictions have been around since the Lower Paleolithic times. Cave paintings of horses, bulls, and hunters have been found on all continents (except Antarctica) dating from 25,000 to 10,000 BCE. More elaborate paintings, sculptures, and ceramic depictions have been found in Egypt, Greece, China, North America, and Meso-America dating from 2,500 to 1,000 BCE. Theater, puppet shows, and opera-like dramas have been documented in China, India, Greece, and Meso-America from as early as 1,500 BCE. There is nothing new about depictions like these, both static and staged.
Stibel & Barrett, two anthropologists, seem to challenge this view:
Lacking any cultural, personal, or historical concept of the idea of a “robot,” it seems unlikely that a twelfth-century human would take the object before them as a human-made artifact designed to “depict” authentic agency. More likely, they would construe this unknown entity as a real agent of some kind.
And yet automata, the ancestors of modern robots, were developed in Europe, the Middle East, and China well before the Common Era (Foulkes, Reference Foulkes2017). Heron of Alexandria (10–90 CE), for example, designed automata that depicted “a shepherd who gave water to his sheep, and even an articulated bird that could whistle” (Foulkes, Reference Foulkes2017, p. 64). Heron in turn inspired the construction of automata throughout Europe and the Middle East, including tabletop marching and fighting armies, flying birds, singing birds, walking lions, a donkey driving a water wheel, and even people playing chess. Truitt (Reference Truitt2015) called these “medieval robots.”
Social robots such as Asimo, Smooth, and Nao are introduced nowadays not only physically but with explicit interpretive frameworks. The same robots introduced in the same way should cause no more trouble to Stibel & Barrett's twelfth-century human than they do to modern humans.
R8. Other issues
Many issues raised in the commentaries deserve further discussion, but we can consider only a few.
R8.1 Theory
Is the depiction model a theory (see Bartneck)? The answer is clearly yes. A theory, according to Dennis and Kintsch (Reference Dennis, Kintsch, Halpern and Sternberg2007), should satisfy certain criteria, and the depiction model does just that. It accords with empirical data; it is precise and interpretable; it is coherent and consistent; it predicts future applications; and it provides explanations that go beyond the model itself. We have cited evidence supporting each of these criteria.
We have also investigated alternative accounts. The media equation was one of the early inspirations for our work, and while the depiction model makes many of the same predictions, it also explains phenomena not covered by the media equation (see target article, sect. 2.4). In his commentary, Reeves, one of the progenitors of the media equation, appears to agree (contra Bartneck). Anthropomorphism and trait attributions are two other alternative accounts, but these suffer from the empirical and conceptual problems we discussed earlier. The point of the depiction model, in sum, is to explain social robots in terms of a broader theory, namely, how people engage with depictions, and we believe it succeeds at that.
R8.2 Social roles
In our paper, we distinguished between self-agents, who “act on their own authority and are fully responsible for their actions,” and rep-agents, who “act on the authority of specified principals.” When Susan works as a server for Goldberg's Bakery, she is a rep-agent for the bakery, but once off work, she is on her own, a self-agent. Healey et al. (see also Carroll) worried about the roles such agents take.
Each individual person, we assume, has an individual role that is continuous and enduring. Susan remains Susan whatever else she does. But individuals also take on additional roles, social roles, that change with the social situation. They may take the social role of sister, companion, playgoer, or bus rider for people they interact with, and teacher, tutor, or concierge in working for others. Susan chose the particular role of server when she hired on at Goldberg's Bakery. Social robots could, in principle, take multiple social roles, but the robots we know of are able to take only one social role.
R8.3 Future
Predicting the future is dangerous. In about 1970, Herbert A. Simon, one of the founders of artificial intelligence (AI), suggested to colleagues that people would be able to talk with computers in 10–20 years. Fifty years later there is still no such computer. People share limited facts with virtual assistants such as Siri, Alexa, and Google Assistant, but as conversations go, these exchanges are primitive (see Marge et al., Reference Marge, Espy-Wilson, Ward, Alwan, Artzi, Bansal and Dey2022). Today's AI systems for conversation still cannot deal with such features as the timing of turns, the use of uh and um, performed depictions, pointing, anchoring, grounding, irony, sarcasm, and empathy. According to Bender and Koller (Reference Bender and Koller2020), current AI models of language cannot be complete in principle because they are based on the form of language alone.
So Malle & Zhao are brave souls. They are clear-sighted about today's robots when they say: “[C]urrent social robots are advertised to be much more capable than they really are – that is, they are largely a pretense, a fiction.” But they venture into Herbert Simon territory when they go on (see also Caruana & Cross; Franklin, Awad, Ashton, & Lagnado (Franklin et al.); and Stibel & Barrett):
Now consider what robots will be like in the future. They will not just be depictions; they will instantiate, as robots-proper, the actions that current robots only depict. Unlike dolls and dummies, they will not just be crafted and controlled by human programs. They will rapidly evolve through directing their own learning and devising their own programs. They will increasingly make autonomous decisions enabled by continuously updated and massively expanded algorithms.
As Simon's prediction shows, the future doesn't always work out the way we think – often for principled reasons. With social robots, who knows what those principles will be.
Still, no matter how humanlike social robots become, they will never be humans. They will always be artifacts intended to depict humanlike social agents. To frame them otherwise would be to engage in deception.
R9. Coda
In 1919, film director Ernst Lubitsch made a silent comedy called “Die Puppe” (“The Doll”). A young man named Lancelot was informed by his rich uncle, a baron, that he had to get married by a certain date if he expected to inherit the family fortune. To trick his uncle, Lancelot arranged to marry an automaton – a beautiful life-sized mechanical doll. The maker of the doll, however, had created the doll in the image of his own beautiful daughter, and he managed to trick Lancelot into marrying his daughter instead of the doll. Happily, it all worked out in the end.
“Die Puppe” appeared the year before Karel Čapek's 1920 play “Rossum's Universal Robots.” Still, in the years that followed, writers needing a word for humanoid automata chose “robot” over “puppet.” What if they had chosen “puppet”? Social robots would now be called “social puppets,” and our claim that “social puppets are depictions of social agents” would be considered a truism. We would have had no paper, and the commentators would have had nothing to comment on. Thanks to Čapek, but not Lubitsch, we all had lots to say.
Target article
Social robots as depictions of social agents
Related commentaries (29)
A more ecological perspective on human–robot interactions
A neurocognitive view on the depiction of social robots
Anthropomorphism, not depiction, explains interaction with social robots
Autonomous social robots are real in the mind's eye of many
Binding paradox in artificial social realities
Children's interactions with virtual assistants: Moving beyond depictions of social agents
Cues trigger depiction schemas for robots, as they do for human identities
Dancing robots: Social interactions are performed, not depicted
Depiction as possible phase in the dynamics of sociomorphing
Fictional emotions and emotional reactions to social robots as depictions of social agents
How cultural framing can bias our beliefs about robots and artificial intelligence
How deep is AI's love? Understanding relational AI
How puzzling is the social artifact puzzle?
Interacting with characters redux
Meta-cognition about social robots could be difficult, making self-reports about some cognitive processes less useful
Of children and social robots
On the potentials of interaction breakdowns for HRI
People treat social robots as real social agents
Social robots and the intentional stance
Social robots as social learning partners: Exploring children's early understanding and learning from social robots
Taking a strong interactional stance
The Dorian Gray Refutation
The now and future of social robots as depictions
The second-order problem of other minds
Trait attribution explains human–robot interactions
Unpredictable robots elicit responsibility attributions
Virtual and real: Symbolic and natural experiences with social robots
When Pinocchio becomes a real boy: Capability and felicity in AI and interactive depictions
“Who's there?”: Depicting identity in interaction
Author response
On depicting social agents