Political scientists often rely on individual responses to survey questions to empirically capture important theoretical concepts and compare individuals from different groups. The ideological left-right (LR) is one such important concept. This broad, shared summary of complex, political reality emerges, as Benoit and Laver remind us, ‘because people over the years have found them simple and effective ways to communicate their perceptions of [the] similarity and difference[s]’ between political parties, politicians, and voters (Reference Benoit and Laver2012, 198). Given this, it is natural that researchers frequently use it to develop and test theories of mass political behaviour. Moreover, the LR metaphor in politics is ubiquitous, from the daily conversation of citizens to the debates among political elites across boundaries and around the globe. Hence, the question about parties, politicians, and voters’ positions on the LR scale has been regularly employed in many cross-national surveys (for example: CSES, the European Social Survey, the ISSP, Latinobarometro, the World Values Survey, and other national election surveys). As a result, it is not uncommon to find that the LR self-placement scale is used for making cross-national comparisons.
Indeed, some scholars have directly compared voters' LR self-placement cross-nationally, with an assumption that LR scales are generally an appropriate instrument for cross-national tests (Dassonneville Reference Dassonneville2021; Freire Reference Freire2008; Knutsen Reference Knutsen1998; Medina Reference Medina2015; Meyer Reference Meyer2017; Noelle-Neumann Reference Noelle-Neumann1998). For example, in his study on the stability of ideological orientations of electorates in West European countries, Knutsen (Reference Knutsen1998, 297) used the population mean of LR self-placements and described ‘the mass public in Ireland was the most rightist. […] followed by Germany, Belgium […]’. Similarly, Medina (Reference Medina2015) relied on the country mean of LR self-placement scores to describe which electorate was further to the right or left among European countries.
Nevertheless, other scholars have addressed concerns about such cross-national comparisons because of the general potential of Differential Item Functioning (DIF) to the LR concept. For instance, some investigated interpersonal differences in how individuals interpret the left and right metaphor more generally (for example, Bauer et al. Reference Bauer2017; Thorisdottir et al. Reference Thorisdottir2007; Zuell and Scholz Reference Zuell and Scholz2019); others have focused on developing scaling techniques to make placements of respondents and/or political actors (for example, political parties) more comparable (for example, Lo, Proksch, and Gschwend Reference Lo, Proksch and Gschwend2014; Weber Reference Weber2011). However, despite these efforts, the findings are rather inconclusive with regard to the comparability of the LR self-placement, depending on the sample and the methodological approaches. Moreover, even if we could tell whether the LR self-placement is cross-nationally (in)comparable, little is known about the extent of such cross-national (in)comparability and how to sample a set of countries that are more comparable.
In this note, we join the above literature by focusing on a specific kind of DIF that may make the LR self-placement cross-nationally incomparable – the DIF that occurs when respondents in different countries systematically differ in how they map the underlying continuous scale of an attitudinal variable to be self-rated to its ordinal answer categories. Since we are interested in cross-national (in)comparability (rather than an interpersonal one), we call this problem ‘cross-national DIF’ (CN-DIF) in this paper. In our definition, this type of response category CN-DIF occurs when, for example, citizens in Spain assess a health condition of a woman in her fifties ‘feeling chest pain and getting breathless after walking 200 meters’ as six on an eleven-point healthiness scale, while citizens in France assess the same health condition as three on the same scale. Of course, there could be differences in the use of the scale among individuals; however, our interest is the incomparability problem when the response category is interpreted systematically differently across populations in different countries.
Using anchoring vignettes as a diagnostic tool, we (1) quantify the degree of CN-DIF of a given concept (and the scale that measures it) and (2) identify problematic cases (that is, countries that are relatively incomparable to others) in which respondents use the scale differently from respondents in other countries. We then apply our measure of CN-DIF – R CN-DIF – to our original surveys in nine European countries as well as to several benchmark studies in political science that utilized anchoring vignettes to assess the cross-national comparability of other important concepts – namely, democracy, political interest, political efficacy, and experts' assessment of the LR positions of parties.
With our proposed measure and the original survey in which we ask respondents to place several hypothetical parties on the traditional eleven-point LR scale, we find that the LR scale suffers relatively little from the kind of CN-DIF we investigate here (that is, the cross-national difference in the use of the response scale) in so far as the concept is considered in policy terms and the comparison is made between Western European countries. Moreover, our results are in line with previous findings in identifying heterogenous entities that scholars should be wary of when making comparisons across groups or when determining their grouping strategies. Overall, our work makes a methodological contribution to the broad literature on comparative political behaviour by offering a useful diagnostic tool for survey practitioners, particularly those who are interested in making cross-national comparisons to empirically capture the extent to which a given theoretical concept suffers from CN-DIF, and to identify problematic cases causing greater incomparability within a given sample of countries.
Assessing Cross-National DIF Using Anchoring Vignettes
DIF refers to the problem in which individuals in different groups (in our case, country) provide systematically different answers to survey questions because of artefactual elements in the measurement process. For example, despite the common wording used for LR self-placements in many surveys, respondents may interpret the question and scale in ways that undermine the cross-national comparability of the resulting measures. In general, there are three major sources of such cross-national incomparability in the use of the LR scale.
First, while many people may think of the LR as an aggregate dimension of various policy domains, other factors such as individual partisanship, long-term values, and social position also play a prominent role in predicting LR self-placements (for example, Dassonneville Reference Dassonneville2021; Freire Reference Freire2006; Inglehart and Klingemann Reference Inglehart, Klingemann, Budge, Crewe and Farlie1976; Medina Reference Medina2015). If the importance of these sources differs across countries, cross-national comparability will suffer. Second, even when they think of the LR in policy terms, respondents in different countries aggregate different kinds of policies into their LR position.Footnote 1 Third, even if respondents think of the concept of LR in a similar fashion (for example, the same set of policy areas), respondents in different countries may systematically differ in how they map the underlying scale of the concept to be self-rated to its ordinal answer categories.
We examine this last potential source of cross-national incomparability – the CN-DIF that results from respondents in some countries systematically using the typical response scales differently from respondents in other countries. For instance, when the underlying extent of ‘left-ness’ that causes someone in one county to label herself a ‘4’, this may be the same level that causes a respondent in a different country to label herself a ‘2’. At the same time, we focus the content of the vignettes on the policy contents. Specifically, we use vignettes to explore the extent of CN-DIF resulting from the differential interpretation of answer categories by priming respondents to think of the LR question explicitly in policy terms with respect to a specific set of policies. For example, suppose voters across European countries define the concept of LR using a similar set of policies. In that case, the remaining source of cross-country incomparability of the LR concept will primarily come from the differential interpretation of the answer categories.
Anchoring vignettes is a useful tool to identify and ameliorate DIF caused by differing interpretations of the ‘cut-points’ defining answer categories (King et al. Reference King2004). It does so by utilizing respondents' assessments of one or more vignettes, which are then used to assess the extent to which (groups of) individuals use the scale differently (identification) and to re-scale their self-assessments relative to where they place the corresponding vignettes (correction).Footnote 2 In this note, we use the anchoring vignette technique primarily for diagnostic purposes rather than corrective to assess the extent to which these data can be used to make reliable comparisons of LR self-placements across countries. Such diagnostic effort is very much in keeping with the scholarly agenda articulated in King et al. (Reference King2004), which clearly anticipated (and implicitly encouraged) this kind of diagnostic use by stating that ‘[…] researchers who are confident that their survey questions are already clearly conceptualized, are well measured, and have no DIF now have the first real opportunity to verify empirically these normally implicit but highly consequential assumptions’ (p.205). In our diagnostic use of anchoring vignettes, our primary concern is the DIF that is systematically related to the respondent's nationality rather than more general forms of DIF among individuals.
Constructing Vignettes for the Placement of Parties on the Left-Right Dimension
Our first task is to construct a set of vignettes that describe different LR ideological positions. The vignettes must be designed in a way that promotes respondents to perceive the scale as unidimensional – in our case, a unidimensional aggregate of a specific set of policy dimensions.Footnote 3 To achieve this, we focus on four specific sub-dimensions that largely constitute the LR concept, particularly in Western democracies: regulation of the economy, support for redistribution, the size and scope of government, and attitudes toward cultural diversity. These policy areas are known to be linked with the LR dimension at both the elite and individual levels across Western European countries (for example, Benoit and Laver Reference Benoit and Laver2012; Van der Brug and Van Spanje Reference Van Der Brug and Van Spanje2009; Wojcik, Cislak, and Schmidt Reference Wojcik, Cislak and Schmidt2021).Footnote 4 We then vary the level of each sub-dimension monotonically such that the levels of each of these dimensions move in the same direction from the leftist vignette to the rightist vignette. In other words, the leftist vignette describes the most lefty positions on all four policy issues; the rightist vignette describes the rightist positions on all policy areas; and the centrist vignettes describe positions in-between these. By explicitly providing these vignettes before asking respondents to place themselves on the same scale, we prime the respondent to think about the LR concept as an aggregator of these dimensions (Hopkins and King Reference Hopkins and King2010). Doing so allows us to concentrate our analysis on the kind of DIF that results from the differential interpretation of ordinal answer categories rather than alternative interpretations of the whole concept.
For several reasons, we construct the vignettes to describe hypothetical parties (rather than hypothetical individuals). As Bauer et al. (Reference Bauer2017) empirically demonstrate, citizens often associate political parties with the concepts of left and right. Indeed, political parties play a critical role in making up citizens' LR orientation (Inglehart and Klingemann Reference Inglehart, Klingemann, Budge, Crewe and Farlie1976). Doing so also mimics how typical surveys are designed when asking about respondents' self-placement on the LR scale: many cross-national and national election surveys ask about the respondents' self-placement along with the questions about how they perceive the positions of other political actors, such as political parties, governments, and prominent political figures. Therefore, similar to the vignettes used in Bakker et al. (Reference Bakker2014), we create several hypothetical parties as our vignettes to gauge the way citizens use the LR scale. Implicit in this process is to have the respondents make comparisons between the parties described in the vignettes and the respondents themselves, which is exactly the key process expected in the ‘anchoring vignettes’ approach.
Figure 1 presents the instruction and vignettes used in our survey. The vignettes can be arranged in an ordered, unidimensional scale, ranging from the left-wing party (Party A) to the centre (or centre-right) party (Party B) and the right-wing party (Party C). We randomized the order of the vignettes and asked the respondents to evaluate the LR position of the three hypothetical parties, followed by the respondents rating themselves using an eleven-point scale. Our original cross-national survey was fielded in early 2020 in nine European countries, using internet panels of respondi AG (roughly 2,000 responses per country): France, Germany, Hungary, Italy, the Netherlands, Poland, Spain, Sweden, and the UK.Footnote 5 Given that the broader literature suggests the discrepancies in the meaning of the LR between Western and Eastern European countries (for example, Tavits and Letki Reference Tavits and Letki2009; Wojcik, Cislak, and Schmidt Reference Wojcik, Cislak and Schmidt2021), the two Eastern European countries – Poland and Hungary – were included to litmus test the performance of our CN-DIF measure, in the expectation that these countries were relatively incomparable to other Western (and Southern) European countries in our sample.
We performed a series of tests that had been carried out in prior research using anchoring vignettes (for example, Bratton Reference Bratton2010; King et al. Reference King2004; Lee, Lin, and Stevenson Reference Lee, Lin and Stevenson2015). We provide the results and relevant discussions in Online Appendix B while summarizing them as follows. First, we evaluated the extent to which respondents perceive the scale of interest as unidimensional – that is, the vignette equivalence test by looking at how many respondents place the vignettes on the same scale in the order we expected. Second, we investigated whether there are systematic differences in vignette placements because even when respondents place vignettes in the same order, people in some countries may systematically shift all vignettes to the left or right of the scale or use only part of the scale. Finally, we compared respondents' self-placements before and after any DIF was corrected (as described by King et al. Reference King2004). This last test includes a parametric and non-parametric approach to correcting DIF and comparing country rankings before and after correction. The presence of severe CN-DIF would lead to dramatic differences between the raw self-placements and the corrected measures. While these tests help to understand the extent of CN-DIF in the standard measure of the LR self-placement, the answer is not completely straightforward.Footnote 6
We thus reanalysed several benchmark studies to compare the results. We chose the benchmark studies because they examined DIF in survey measures of other important concepts in political science – namely, King et al.'s (Reference King2004) study on political efficacy; Bratton's (2010) research on the assessment of democracy; Bakker et al.'s (Reference Bakker2014) work on experts' LR placement of political parties; and Lee, Lin, and Stevenson's (Reference Lee, Lin and Stevenson2015, Reference Lee, Lin and Stevenson2016) studies on political interest. All results from the above-mentioned tests, along with re-analyses of benchmark studies, are reported in Online Appendix B. To summarize, our diagnosis tests suggest that, in general, the LR scale suffers relatively little from CN-DIF and is clearly less problematic than the case of political efficacy in King et al. (Reference King2004).
We understand that these diagnoses may not be sufficient to answer whether the extents of CN-DIF we reveal in these analyses are low enough to assure researchers that the LR self-placement is truly cross-nationally comparable The answers to this question might only be suggestive and relative rather than definite. Nevertheless, there could be ways through which we could at least obtain better insights and have better standards for such an evaluation. In the following section, we propose a measure indicating the extent of CN-DIF and compare our results with other benchmark studies beyond an obvious high CN-DIF case of political efficacy.
Assessing the Extent of Systematic Cross-National Variation in Vignette Placement
Measuring CN-DIF
To approximate CN-DIF, we take a ‘parsing variance’ approach. A typical example of this approach is the estimation of the intraclass correlation coefficient (ICC), which measures the ratio of between-group variance to the total variance. Greater values of ICC indicate that a large portion of the total variance is attributable to between-group differences. In our case, theoretically, the total variance of vignette placements for a sample of individuals grouped by country and vignette can be decomposed into variance between vignettes (on average), the average variance between countries (within vignettes), and the remaining variance across individuals within countries and vignettes. When CN-DIF exists, a large portion of the total variance is expected to come from the variation across countries. We can formally compute these quantities to get a sense of the levels of CN-DIF. A simple way to compute these quantities is to estimate a multi-level model for vignette placements in which
where ykji denotes the placement of vignette j in country k by respondent i.Here, α represents the constant, uk represents the random intercepts for k = 1…K countries, uj represents random intercepts for each of j = 1… J vignettes, ujk represents the random intercepts for J*K vignettes, and ejki represents the residual that captures the random effect on placements attributable to unmeasured factors idiosyncratic to the individual-country-vignette.
Assuming that each of the random effect terms is distributed independently normal with zero means, each will contribute a variance term to the likelihood function and estimates of these variance terms can be used to produce direct measures of the proportion of variance attributable to vignettes vs. countries (as well as estimates in the uncertainty around this proportion). We take the proportion of the variance attributable to the country (as against the total defined by variances attributable to country, vignettes, and country vignettes) as a measure of CN-DIF. The ratio is calculated with:
As it is a measure of proportion, the R CN-DIF ranges from 0 to 1. Suppose the proportion of variance attributable to country is near zero. In that case, this indicates that little of the variation in the vignette placements can be attributed to unmeasured factors relevant to the respondents' nationalities. In contrast, if the proportion is near 1, it suggests that the variation in vignette placements is much more closely related to respondents' nationalities than differences in vignettes. Therefore, we estimate the necessary quantities by using the individual-level data.Footnote 8 Table 1 presents the estimates for R CN-DIF, the proportion of the variance attributable to between-country differences for each of the benchmark studies, and our own data.
Our results from the benchmark studies are largely consistent with what the authors concluded in their works (in the third column). For instance, the proportion of variance attributable to country relative to vignette is around 0.9 in King et al.'s study of political efficacy, which is much greater than what we observe in other studies, and this is in line with King et al.'s conclusion that political efficacy is not comparable across countries, at least between China and Mexico. Meanwhile, the lower levels of variance associated with the country for the works on democracy, experts' LR placement of parties, and political interest indicate that the concepts studied in these works do not suffer serious CN-DIF. Again, this is consistent with what the authors of these studies initially concluded. For instance, Bratton (Reference Bratton2010) claims that ‘[the study's] result provides a preliminary rebuttal against the cynical claim that the original “D-word” formulation is completely incomparable’ (p.112).
When it comes to our own study, R CN-DIF is 0.112. This result on our LR policy measure clearly looks much more like the concepts without significant CN-DIF (for example, political interests and democracy) than the opposite (for example, political efficacy). Moreover, while we have a similar study examining whether LR placements among experts are comparable across countries (Bakker et al. Reference Bakker2014), the results we present here suggest that average voters perform only slightly worse than experts. We take the results as evidence against the idea that CN-DIF might be a serious problem for comparing the LR self-placements across countries. Nevertheless, as an aggregate measure (that is, at the level of a concept), R CN-DIF is largely determined by what countries are included in a study. In the next section, we measure the country-specific R CN-DIF to help researchers identify the countries that might be less comparable to others.
Identifying Problematic Cases That Behave Differently From Others
The specific issue of CN-DIF addressed in this study arises from the possibility that respondents from different countries systematically use a response scale differently. If this is the case, we expect to see a greater R CN-DIF score when grouping countries that are very different from each other to examine a particular concept, compared to the score obtained when studying the same concept based on a group of homogeneous countries. In essence, by comparing RCN-DIF scores from different groupings of countries, it further enables us to compute a country-specific score that indicates whether a country is suitable for inclusion when comparing a concept of interest across countries. This score could be useful guidance to researchers about the dangers of making comparisons with specific countries.
To calculate a country-specific measure of CN-DIF for each country in each study, we estimate the same random-effect model (as shown in Table 1) for each pair of countries included in the dataset. We then average the scores from all country-pairs that include a given country. The resulting score for a specific country is the average variance in vignette placements explained by country (than by vignettes) from all possible pairs of countries that contain a given country. When a country deviates substantially from others in terms of how respondents place vignettes, we expect to observe a greater score. We compute the country-specific CN-DIF scores and 95 per cent confidence intervals by bootstrapping the individual-level data and repeating the estimation process 1,000 times. The results for our study of LR placement, as well as those for Bratton (Reference Bratton2010); Bakker et al. (Reference Bakker2014); and Lee, Lin, and Stevenson (Reference Lee, Lin and Stevenson2016) are illustrated in Fig. 2.Footnote 9
Figure 2 provides some guidance regarding countries that should be included and those that may be better excluded. For instance, the results from Bratton's data (upper-right panel) suggest that when studying the perception of democracy or similar concepts cross-nationally in Africa, researchers may want to exclude countries like Botswana, Madagascar, and Malawi to ensure that inference or conclusion does not suffer from biases due to potential CN-DIF. Likewise, when comparing political interests cross-nationally (Lee, Lin, and Stevenson Reference Lee, Lin and Stevenson2016, bottom-left panel), researchers may opt against directly comparing the raw reported levels of political interest from Asian countries such as China, Japan, and Korea with those from North American and European countries. The problematic cases identified here largely overlap the ones the benchmark studies pointed out as countries that needed special attention. For instance, Bratton (Reference Bratton2010) mentions Botswana and Malawi as countries ‘whose [corrected] ranks change radically’ (p.112). Lee, Lin, and Stevenson (Reference Lee, Lin and Stevenson2016) also specify that China, Korea, and Japan are countries with ‘relatively high scores for the low interest case,’ although they contend that there is not enough CN-DIF to substantially undermine their general conclusions.
When it comes to LR party placements made by political experts, our results indicate that there is little CN-DIF (in the sense that there is no country whose score is significantly different from other countries). At the same time, the extremely wide confidence intervals for Greece, Latvia, Lithuania, and Slovenia warrant further attention.Footnote 10 Two of these countries – Greece and Latvia – are described as being less comparable in Bakker et al.'s original work (p.5) based on their pair-wise comparison of the mean placements (of vignettes).
Finally, the results regarding citizens' LR self-placements suggest that we should be wary of comparing Eastern and Western European countries on LR policy ideology. In our data, respondents in Hungary and Poland seem to use the scale of the LR position very differently from their Western European counterparts. These findings could be explained by several empirical features, such as the tendency for leftist parties in post-communist countries to lean economically conservative while rightist parties tend to be more socially liberal (for example, Kitschelt Reference Kitschelt1992; Tavits and Letki Reference Tavits and Letki2009; Vachudova Reference Vachudova2008), and the rise of right-wing populist parties in recent years, which has taken over the traditional voting base of leftist parties in Poland and Hungary (Berman and Snegovaya Reference Berman and Snegovaya2019).Footnote 11 Although it is not a new insight that there are political and cultural differences between Western and Eastern European countries, our CN-DIF measure effectively highlights these differences by providing a quantified description of the extent to which each of the countries in the sample is comparable to others.
Conclusion and Discussion
A general message from this study is that when voters are primed to consider LR in terms of the usual policy debates prevalent in Western societies, they attribute a similar meaning to the ordinal categories on which they are asked to record their responses. That is, a person in Spain seems to think of the meaning of a ‘2’ on this LR policy scale similarly to a person in Germany. While this conclusion may seem narrow in scope, it is directly useful for researchers who are specifically interested in measuring the policy content rather than the general notion of LR. Our study suggests that priming respondents to interpret the LR question in relation to specific policy content can be an effective strategy for creating cross-national comparability of the resulting measures,at least within Western democracies. At the same time, this may provide empirical grounds for previous studies that used a direct comparison of voters' LR self-placements cross-nationally (for example, Dassonneville Reference Dassonneville2021; Knutsen Reference Knutsen1998; Medina Reference Medina2015). Beyond this narrow conclusion, this study should also count against blanket critiques of LR self-placement scales that assume voters must use these scales differently across countries.
In addition to our general finding, this work contributes to the field of comparative political behaviour by providing quantitative tools to assess the extent of CN-DIF. Specifically, the measure of CN-DIF we propose, RCN-DIF, can assist future research in identifying the extent of cross-national comparability for a particular concept within a sample of countries. Our country-specific metric further indicates the extent to which a specific country ascribes to the incomparability of the concept within the sample of countries. As demonstrated, the cross-national comparability of citizens' LR self-placement may suffer when attempting to compare Western European countries to Eastern European ones. While the primary goal of previous works was to reveal the presence of cross-national incomparability in LR placements (for example, Bauer et al. Reference Bauer2017; Lo, Proksch, and Gschwend Reference Lo, Proksch and Gschwend2014; Zuell and Scholz Reference Zuell and Scholz2019), our goal was to quantify the degree of cross-national incomparability and to identify potentially problematic cases using our CN-DIF measure. Finally, our application of R CN-DIF to benchmark studies found that our quantification of CN-DIF yields similar conclusions to those of the original articles, which employed different analytic approaches. We believe this validates the effectiveness of our proposed CN-DIF measure.
Supplementary material
The supplementary material for this article can be found at https://doi.org/10.1017/S0007123423000236.
Data availability statement
Replication Data for this article can be found in Harvard Dataverse at: https://doi.org/10.7910/DVN/LC4QVF.
Acknowledgements
Both authors contributed equally to this work, and the author's ordering reflects the principle of rotation. Finally, the authors thank Randy Stevenson, Ryan Bakker, and the anonymous reviewers for their thoughtful feedback and suggestions.
Financial support
The authors acknowledge financial support from the German Research Foundation (DFG) via Collaborative Research Center (SFB) 884 ‘The Political Economy of Reforms’ (projects C1) at the University of Mannheim.
Competing interests
The authors declare no conflicts of interest in this research.