In their paper, ‘Accentuation and compatibility: Replication and extensions of Shafir (Reference Shafir1993) to rethink choosing versus rejecting paradigms’, Chandrashekar et al. (Reference Chandrashekar, Weber, Chan, Cho, Chu, Cheng and Feldman2021) report a ‘very close replication’ study of a series of choose–reject problems documented in Shafir (Reference Shafir1993). In the original 1993 paper, Shafir reported several studies showing that choosing and rejecting, contrary to standard assumptions, often are not complementary. Across several choice problems, ranging from parental custody to vacation decisions, gambles, and ice cream choices, Shafir asked participants either to choose or to reject one of two options (the problems were not all binary, but the analysis remains similar). Shafir found that the enriched option, the one with more positive and negative attributes, which he interpreted as providing more compelling reasons for choice and rejection, had a greater share of being chosen and rejected than the impoverished option, which provided weaker reasons for choice or rejection. In Shafir’s (Reference Shafir1993) study, the enriched option’s share of being chosen and rejected added up to significantly more, and the impoverished option to significantly less, than the expected 100%. In some cases, the enriched option was more likely to be both chosen and rejected. To explain this pattern, Shafir appealed to the notion of compatibility, suggesting that options’ strengths weigh more heavily when people choose, whereas weaknesses matter more when people reject, and that enriched options, presenting both more positive and more negative attributes, thus receive more than their fair share of choice and rejection (Shafir, Reference Shafir1993; see also Shafir et al., Reference Shafir, Simonson and Tversky1993 for a review).
Chandrashekar et al. (Reference Chandrashekar, Weber, Chan, Cho, Chu, Cheng and Feldman2021) report a ‘very close replication’ study of a series of choose–reject problems reported in Shafir (Reference Shafir1993) and conclude that, ‘Taken together, the replication findings do not indicate consistent support for the original findings.’ This echoes an earlier failure to replicate reported by Many Labs 2 (Klein et al., Reference Klein, Vianello, Hasselman, Adams, Adams, Alper and Batra2018; see also Shafir, Reference Shafir2018 for commentary), who presented Shafir’s (Reference Shafir1993) original Custody Problem, about awarding or denying custody of a child to one of two parents, to thousands of participants in several countries.
In this brief commentary, we address Chandrashekar et al.’s (Reference Chandrashekar, Weber, Chan, Cho, Chu, Cheng and Feldman2021) conclusions regarding the failure to replicate the original problems. In doing so, we provide new data on the replicability of Shafir’s (Reference Shafir1993) studies and discuss challenges with the earlier “very close” replications. We focus our analysis on the Custody Problem because it is the one problem used in both recent replication projects, but our argument applies more generally (and is further discussed, along with several new choose–reject problems including a revised version of Shafir’s (Reference Shafir1993) Problem 2, the Vacation Problem, in Cheek and Shafir (Reference Cheek and Shafir2024)).
1. Replications and cultural changes between then and now
Let us begin with another, related study. A little over 20 years ago, Downs and Shafir (Reference Downs and Shafir1999) investigated the role of compatibility and its impact on enriched versus impoverished options in the realm of social judgment. They presented participants with names of well-known personages with similar occupations, where, in each pair, respondents were more familiar with one personage than the other. For example, American participants were significantly more familiar with Ronald Reagan (an elder stateman at the time, who had been President of the United States a decade earlier) than with John Major (who had recently been the UK’s Prime Minister). Similarly, American participants were more familiar with Woody Allen (an American director at the height of his career at the time) than with Federico Fellini (a great Italian director barely known in the United States).
Participants were presented with various adjectives and had to select which personage in a pair was better described by each adjective. As predicted, and consistent with the compatibility hypothesis, participants were more likely to select the more familiar over the less familiar personage across opposite adjectives. For example, Reagan was judged as more confident than John Major by 61% of respondents, and as more insecure by 71%, totaling 132% across these opposing adjectives (with John Major receiving a total of 68%). Woody Allen received a total of 119% across confident and insecure, compared to Fellini’s 81%. José Canseco totaled 149% as compared to Tony Gwynn’s 51%, David Letterman totaled 139% compared to Kathy Lee Gifford’s 61%, and so forth. Not all judgments were quite this extreme, but the overall tendency of the more familiar personages, the ones offering more features compatible with the judgment, to be selected across opposite adjectives was pronounced and highly significant.
Now, how would a ‘very close replication’ of this study proceed? A ‘very close replication’ insists on using the same stimuli as in the original, with the claim that it is instructive to see how the phenomena persist through time. However, clearly the highly familiar items—Letterman, Canseco, and even Reagan—will not persist through time, or at least not to the same degree. The very hypothesis that generated these results in the late 20th century—namely, that participants will more easily find compatible instances in the familiar personages—predicts that those same stimuli will not replicate 30 or 40 or 50 years later, when those personages’ renown will have waned. (Furthermore, it is unlikely, e.g., that having married one adopted daughter and allegedly abused another, Woody Allen would retain his 1990s 70% ‘more moral’ advantage over Fellini.) The judgmental compatibility phenomenon published by Downs and Shafir in 1999 should, of course, replicate, but it would almost certainly require new stimuli: instead of Woody Allen and José Canseco, we may need Kate McKinnon and LeBron James.
Clearly, some phenomena, such as classical optical illusions, which are the outcome of evolutionary trends (Gregory, Reference Gregory2009) will replicate over long periods of time, whereas other phenomena, like those that depend on people’s attitudes toward Woody Allen or José Canseco, marriage, smoking, gender roles, or the environment, can change in just a few years. Where exactly to locate findings along this continuum is a theoretically interesting and nontrivial question. What is clear is that some findings are going to be time-sensitive in ways that optical illusions are not. And ‘very close replications’ need to observe those distinctions, or they risk generating confusing failures to replicate outdated items, rather than contributing to our understanding of the phenomena that lie behind them (for similar points, see, e.g., Ferguson et al., Reference Ferguson, Carter and Hassin2014; McGuire, Reference McGuire2013). This brings us to the issue at hand: replicating the patterns of choosing and rejecting documented by Shafir (Reference Shafir1993) three decades later.
2. Choosing and rejecting 30 years later
With the goal of conducting a ‘very close replication,’ Chandrashekar et al. (Reference Chandrashekar, Weber, Chan, Cho, Chu, Cheng and Feldman2021) chose to run the original materials used by Shafir (Reference Shafir1993). Both replication projects (Chandrashekar et al., Reference Chandrashekar, Weber, Chan, Cho, Chu, Cheng and Feldman2021; Klein et al., Reference Klein, Vianello, Hasselman, Adams, Adams, Alper and Batra2018) used Shafir’s Custody Problem. The original problem read as follows (with the original results, from Shafir, Reference Shafir1993, p. 549, reproduced below):
Original Problem:
Imagine that you serve on the jury of an only-child sole-custody case following a relatively messy divorce. The facts of the case are complicated by ambiguous economic, social, and emotional considerations, and you decide to base your decision entirely on the following few observations. [To which parent would you award sole custody of the child?/Which parent would you deny sole custody of the child?]
Pilot testing when the Problem was first run had found that Parent A’s attributes were perceived as neutral, essentially offering no compelling reason to award or deny custody, whereas Parent B’s attributes were compatible with both choice and rejection. ‘Above-average income’ and ‘very close relationship with the child’ were highly positive (compatible with choice), whereas ‘lots of work-related travel’, and ‘minor health problems’ were much more negative (compatible with rejection; ‘extremely active social life’ in that context was close to neutral). In fact, Parent B’s rates of being awarded and denied (119%) exceeded the total of 100% expected if choosing and rejecting were complementary, z = 2.48, p < .02.
This problem, however, is not a tidy optical illusion—it is built on attributes that may very well change over 30 years. ‘Average working hours’ might sound more positive at a time when ‘more than ever, workers want to work fewer hours’ (Lufkin and Mudditt, Reference Lufkin and Mudditt2021), and ‘average income’ may sound more appealing in an era of rapidly increasing economic hardship that has left millions of Americans without adequate income to meet their basic needs. Similarly, as cultural norms change and perhaps grow more conservative, ‘extremely active social life’ may, three decades later, be judged more negatively, connoting a certain neglect of family life in favor of fun. ‘Lots of work-related travel’ was viewed, in the late 1980s, highly negatively. Subsequent decades, however, brought major changes, including the enormous increase in the frequency and popularity of work-related travel.Footnote 1 When we collected ratings for various parental attributes in late 2017, we found that ‘lots of work-related travel’ was rated neutral—perceived as neither negative nor positive. But when we collected ratings again in the fall of 2022, perhaps due to the coronavirus pandemic and a shift to remote work, ‘lots of work-related travel’ was perceived negatively again.Footnote 2 It is not always easy to predict how cultural change, unprecedented global crises like the pandemic, or, for that matter, some ‘fashionable’ associations, may change how people perceive certain stimuli three decades later. As a result, timely pilot testing will virtually always be a plus.
3. Pilot surveys
We conducted two preregistered pilot surveys to gauge attribute valence among online participants. Participants (Pilot Survey 1: n = 172 MTurkers, Pilot Survey 2: n = 164 MTurkers; recruited using CloudResearch; Litman et al., Reference Litman, Robinson and Abbercock2017) read the following instructions:
Imagine a child custody case following a messy divorce. One of the parents must have custody of the child. The parents are described by various attributes below. For each attribute, please indicate on the provided scale how ‘positive’ (good) or ‘negative’ (bad) in your opinion it is for a parent to have that attribute.
Participants were then presented with the original 10 attributes as well as several new attributes, and they rated each attribute’s valence on a scale from −5 (highly negative) to 0 (neutral) to 5 (highly positive). Both surveys followed this procedure; the second survey was used to rate additional attributes for new versions of the Custody Problem. We preregistered both Pilot Survey 1 (https://aspredicted.org/by4jy.pdf) and Pilot Survey 2 (https://aspredicted.org/my3ib.pdf) through AsPredicted.org. Data, materials, and analysis code for all studies are available on the Open Science Framework (https://osf.io/cxst6/). Ratings for all attributes from both surveys are presented in Table 1.
Note: Descriptive statistics—means (SDs)—from the two pilot surveys. Attributes were rated from −5 (highly negative) to 0 (neutral) to 5 (highly positive). The attribute ‘average working hours’ was rated twice in Pilot Study 1. The second time it was rated, the average rating was 1.98 (SD = 1.77).
As was to be expected, we found that the valence of some of the original attributes had changed. Interestingly, several attributes deemed neutral 30 years ago—average working hours, reasonable rapport with the child, relatively stable social life—were now quite positive. In fact, the impoverished parent, viewed neutrally three decades earlier, was viewed quite positively in 2022.
4. New (properly normed) versions of the Custody Problem
Based on the updated attribute ratings, we composed two new versions of the Custody Problem. These were close variations on the original problem with updated attributes intended to correct for changes in attribute perceptions over time. In the first new version, we ensured that the impoverished parent’s attributes (Parent A below, though actual order of presentation was counterbalanced) were all relatively neutral (absolute value of average valence ratings below .55). We further ensured that the enriched parent’s attributes (Parent B below) were either highly positively rated (three attributes rated 2.23 or higher) or negatively rated (two attributes rated −.92 or lower) in a combination roughly approximating the original enriched parent’s attribute ratings. The instructions were identical to those in Shafir (Reference Shafir1993) and reproduced in the context of the Original Problem above. For both versions, we followed Simonsohn’s (Reference Simonsohn2015) guidelines for powering replications by recruiting enough participants to ensure at least 2.5 times the original sample size (170), plus an additional cushion to detect even smaller effects.
We administered the first new Custody Problem in Study 1, preregistered through AsPredicted.org (https://aspredicted.org/n94mv.pdf) and run on Prolific. It is shown below, along with the percentage of participants who selected each option in the choose and the reject conditions (n = 552)Footnote 3 :
We found that the majority (71%) of participants asked to award custody awarded custody to the enriched parent, and the majority (57%) of participants asked to deny custody denied custody to the enriched parent. Mirroring Shafir’s (Reference Shafir1993) results, the sum of percentages of participants selecting the enriched option in the two conditions exceeds the 100% that would be expected if choosing and rejecting were complementary, instead totaling 128%, z = 6.64, p < .001. (Chandrashekar et al., Reference Chandrashekar, Weber, Chan, Cho, Chu, Cheng and Feldman2021, reported one-tailed tests; we follow their convention, but the p-values for this problem and the next problem are below .001 with both two-tailed and one-tailed tests.)
For the sake of robustness, we composed a second new version with a slightly different goal. Whereas the New Custody Problem 1 intentionally approximated the attribute valences in the original problem, it used a different set of attributes. In our second new version, we sought to more closely approximate the attributes used in the original, while still aiming to compose fairly balanced enriched and impoverished options. Accordingly, we assigned the impoverished parent (Parent A below) two of the original traits that were currently somewhat positively rated (‘average income’ and ‘average health’), while including one slightly negatively rated attribute (‘not very strict about household rules’) and keeping the other attributes close to neutral. We then composed the enriched option using the same attributes as in the original, with two attributes highly positively rated and one attribute highly negatively rated, along with two slightly negative attributes. We reasoned that revising the original impoverished parent’s currently highly positively rated attributes may allow us to retain the original enriched option and replicate a pattern similar to the original.
We administered the second new version in Study 2, preregistered through AsPredicted.org (https://aspredicted.org/fg7vn.pdf) and run on Prolific. The New Custody Problem 2 is shown below, along with the percentage of participants who selected each option in the choose and the reject conditions (n = 564)Footnote 4:
As before, and similar to the original, the sum of the percentages of participants who selected the enriched option in the two conditions exceeds the 100% expected if choosing and rejecting were complementary, instead totaling 119%, z = 4.40, p < .001. Both new versions of the Custody Problem yielded results that are consistent with the original findings of Shafir (Reference Shafir1993).
5. Concluding thoughts about ‘very close replications’
We began by gauging the valence of possible attributes to be assigned to the impoverished and enriched options in the Custody Problem, and we then composed two new versions of the problem that yielded results consistent with Shafir’s (Reference Shafir1993) findings. We found that the attributes used in the original Custody Problem had different valences three decades later, sufficient to render the original problem inadequate for testing Shafir’s original research question. In fact, the results of the ‘very close replications’ of Chandrashekar et al. (Reference Chandrashekar, Weber, Chan, Cho, Chu, Cheng and Feldman2021) and Klein et al. (Reference Klein, Vianello, Hasselman, Adams, Adams, Alper and Batra2018) show that the original problems run 30 years ago do not reproduce the original findings at present. (We also ran the original later in 2021 and failed to obtain the 1993 results, as reported in Cheek and Shafir Reference Cheek and Shafir2024.) Nonetheless, the original decision pattern documented in Shafir (Reference Shafir1993), with an asymmetry between choice and rejection occurring due to the enriched option obtaining more than its fair share relative to the impoverished option, appears alive and well.
We are not the first to raise concerns about the practice of simply readministering the same experimental materials in a sufficiently different context and then interpreting the different results as a sign of the initial findings’ unreliability (e.g., Crandall and Sherman, Reference Crandall and Sherman2016; Gergen, Reference Gergen1973; Schwarz and Strack, Reference Schwarz and Strack2014). Yet, the practice appears to persist without fully addressing these concerns. We view this as problematic and in need of refinement. When stimuli are context- or time- (or, obviously, language-) sensitive, running them ‘as is’, in different contexts, years later, in different cultural context, or in a language foreign to the respondents, without the appropriate updates or translations, is bound to mislead. From the Me Too movement to diets, commuting, bragging presidents, and family roles, enough has changed in the United States over the past 30 years to warrant a revision of the attributes used to describe relevant stimuli before certain replications ought to be explored. Otherwise, ‘very close replications’ do not seem, from a theoretical perspective, very close after all.
At a minimum, stimuli need to be revisited, in many circumstances preferably in collaboration with the original authors, to ensure that they afford a valid test of the original hypothesis (see also Fiedler et al., Reference Fiedler, McCaughey and Prager2021 on manipulation checks). In addition, researchers attempting replications need to discuss their findings with greater subtlety. It is problematic, in our view, to conclude that a decades-old finding ‘does not replicate’ without sufficient consideration for the different contexts (time, culture, norms, associations, etc.) in which the studies occurred. In some circumstances, all that replicators should be free to conclude is that the original materials, years later, or in different contexts, do not produce the same pattern of results, a finding that we find of limited insight by itself.
There is, furthermore, a concern that replications with little theoretical validity can impede research progress. Chandrashekar et al. (Reference Chandrashekar, Weber, Chan, Cho, Chu, Cheng and Feldman2021) used the original materials from Shafir (Reference Shafir1993) not only in an attempt at a replication, but also to address some theoretical work around the proposed mechanism of ‘accentuation’ (Wedell, Reference Wedell1997). Because of their faulty stimuli, however, it may be difficult to know what to make of the rest of their theoretical treatment.
In his analysis of choice–reject discrepancies, Shafir (Reference Shafir1993) attributed the observed patterns to the well-established principle of compatibility, where the weighting of inputs is enhanced by their compatibility with the response (see, e.g., Kornblum et al., Reference Kornblum, Hasbroucq and Osman1990; Proctor and Reeve, Reference Proctor and Reeve1990; Shafir, Reference Shafir1995; Slovic et al., Reference Slovic, Griffin, Tversky and Hogarth1990 for relevant discussions). Wedell (Reference Wedell1997) presented data in tension with this interpretation, and argued for an ‘accentuation’ hypothesis, according to which a greater demand for justification in choice compared with rejection leads to accentuation of the differences between alternatives in the choice condition. Cheek and Shafir (Reference Cheek and Shafir2024) document several more choose–reject patterns and discuss the contributions of both compatibility and accentuation, as well as the significant role of simple response errors, in contributing to the emergence of the choose–reject discrepancy.
We hope that this brief reply contributes both to basic research on decision-making and to more careful design and interpretation of replication studies. For now, what is clear is that the choose–reject patterns documented more than 30 years ago are, given the necessary updates, real and replicable.
Data availability statement
The datasets for this article can be found at https://osf.io/cxst6/.
Funding statement
This research received no specific grant funding form any funding agency, commercial, or not-for-profit sectors.
Competing interest
The authors declare none.
Appendix
Additional details about data collection and demographic characteristics of participants are reported for all studies below.
Pilot Survey 1
We recruited 200 participants from Mechanical Turk using CloudResearch (Litman et al., Reference Litman, Robinson and Abbercock2017). To be included in analyses, participants had to pass an instructional manipulation check and confirm that they did not respond randomly. A total of 172 participants met these criteria and were included in analyses. This sample included 82 women, 88 men, and 2 nonbinary participants. Then, 22 participants identified as Black, 11 identified as Asian, 10 identified as Latinx, 4 identified as Native American, 124 identified as White, 6 identified as Multiracial, and 2 identified with additional categories (numbers may exceed sample size because participants could select multiple categories). The average age of participants was 38.16 (SD = 11.57) and age ranged from 19 to 72.
Pilot Survey 2
As in Pilot Study 1, we used CloudResearch to recruit 200 participants from Mechanical Turk and excluded those who failed an instructional manipulation check and/or indicated that they responded randomly, leaving a final sample of 164. This sample included 82 women, 81 men, and 1 nonbinary participant. Then, 22 participants identified as Black, 14 identified as Asian, 14 identified as Latinx, 5 identified as Native American, 116 identified was White, 8 identified as Multiracial, and 2 identified with additional categories (numbers may exceed sample size because participants could select multiple categories). The average age of participants was 38.22 (SD = 11.90) and age ranged from 19 to 70.
Study 1
In Study 1, we aimed to recruit 600 participants through Prolific to ensure that we had more than 2.5× times (see Simonsohn, Reference Simonsohn2015 for details) the original sample size of Shafir (Reference Shafir1993), along with additional power to further detect potentially smaller effect sizes within our budgetary constraints. In total, 600 participants completed the study, of whom 552 met the inclusion criteria (same as previous studies). This sample included 235 women, 299 men, 17 nonbinary participants, and 1 who did not report gender. Then, 47 participants identified as Black, 75 identified as Asian, 60 identified as Latinx, 5 identified as Native American, 376 identified as White, 30 identified as Multiracial, and 2 identified with additional categories (numbers may exceed sample size because participants could select multiple categories). The average age of participants was 36.42 (SD = 13.14) and age ranged from 18 to 93.
Three hundred sixty-six participants reported annual personal incomes ranging from $0K-$50K, 137 reported annual personal incomes ranging from $50K-$100K, and 48 participants reported annual personal incomes above $100K. Two hundred forty-four participants reported annual household incomes ranging from $0K-$50K, 180 reported annual household incomes ranging from $50K-$100K, and 127 reported annual household incomes above $100K. On a 10-point scale, participants reported an average subjective social status of 4.76 (SD = 1.78) and on a scale from 0 (completely liberal) to 100 (completely conservative) participants leaned liberal, with an average rating of 34.18 (SD = 28.05).
Study 2
Following the sample size planning of Study 1, we recruited 600 participants through Prolific, of whom 564 met the inclusion criteria (same as previous studies). This sample included 244 women, 301 men, and 19 nonbinary participants. 48 participants identified as Black, 56 identified as Asian, 45 identified as Latinx, 7 identified as Native American, 413 identified as White, 19 identified as Multiracial, and 2 identified with additional categories (numbers may exceed sample size because participants could select multiple categories). The average age of participants was 38.36 (SD = 14.41) and age ranged from 18 to 80.
Here, 359 participants reported annual personal incomes ranging from $0K to 50K, 147 reported annual personal incomes ranging from $50K to $100K, and 58 participants reported annual personal incomes above $100K. Then, 206 participants reported annual household incomes ranging from $0K to $50K, 212 reported annual household incomes ranging from $50K to $100K, and 146 reported annual household incomes above $100K. On a 10-point scale, participants reported an average subjective social status of 5.11 (SD = 1.73) and on a scale from 0 (completely liberal) to 100 (completely conservative) participants leaned liberal, with an average rating of 33.27 (SD = 27.09).