Introduction
Theoretical constructs are at the heart of second language (L2) research. Well-known examples include L2 motivation (Alamer et al., Reference Alamer, Morin, Alrabai and Alharfi2023; Dörnyei & Ryan, Reference Dörnyei and Ryan2015), L2 emotions (Dewaele & Li, Reference Dewaele and Li2020; Nakamura et al., Reference Nakamura, Darasawang and Reinders2021; Pritzker et al., Reference Pritzker, Fenigsen and Wilce2019), and L2 anxiety (Alamer & Lee, Reference Alamer and Lee2021; Horwitz, Reference Horwitz2010). Often, constructs are not directly empirically observable. To make them accessible to empirical research, researchers develop and validate scales and inventories (Dörnyei & Dewaele, Reference Dörnyei and Dewaele2022; Iwaniec, Reference Iwaniec, McKinley and Rose2019). Almost by default, L2 scholars rely on the common factor model (also known as the reflective measurement model) to validate their scales, and they mainly employ confirmatory factor analysis (CFA; Jöreskog, Reference Jöreskog1969) or the newly introduced method, exploratory structural equation modeling (ESEM; Asparouhov & Muthén, Reference Asparouhov and Muthén2009; also see Alamer & Marsh, Reference Alamer and Marsh2022; Marsh & Alamer, Reference Marsh and Alamer2024) for that purpose. Following the common factor model, scale items are regarded as error-prone measurements of their latent variable, which means that the items are assumed to share a common cause that is responsible for their covariance structure.
However, not all constructs follow this definition. Instead, constructs can also emerge from elements/parts, i.e., the items define and form the construct and not the other way around. In such instances, the construct has the character of a collection or an inventory. Therefore, these types of constructs are called emergent variables (Henseler & Schuberth, Reference Henseler and Schuberth2020; Henseler, Reference Henseler2021). The emergent variable (also called a component, formative construct, or composite construct) “is an abstraction that results from the combined effects of all of the particular measures” (Cole et al., 1993, p. 175). An example from psychology is mother’s availability to interact with and monitor a particular child. This construct merely groups the following three variables, namely, the number of children in a family, illness of the mother, and hours of maternal employment. It can be argued that these variables are distinct in meaning and not interchangeable, which makes them elements, thereby aligning more with the emergent variable perspective. In the following, we argue that L2 research also deals with constructs that are made up of elements. Potential examples are L2 achievement (Papi & Khajavy, Reference Papi and Khajavy2021; Sparks & Alamer, Reference Sparks and Alamer2022; Reference Sparks and Alamer2023), first language skills (Sparks, Reference Sparks2023), and the use of strategies such as language learning strategies (Oxford & Griffiths, Reference Oxford and Griffiths2016), vocabulary learning strategies (Alamer et al., Reference Alamer, Teng and Mizumotoin press), and metacognitive reading strategies (Alamer & Alsagoafi, Reference Alamer and Alsagoafi2023; Mokhtari & Reichard, Reference Mokhtari and Reichard2002). Because emergent variables are made up of their elements but do not cause them, the implied assumptions of the common factor model seem to be invalid. Therefore, using CFA is hardly suitable for empirically studying emergent variables. So, which method should be used instead?
A recently developed analytical method, namely confirmatory composite analysis (CCA) (Schuberth et al., Reference Schuberth, Henseler and Dijkstra2018), allows researchers to empirically investigate emergent variables. CCA is analogous to CFA, with the crucial difference being that CCA builds on the composite model rather than the common factor model (Schamberger et al., Reference Schamberger, Schuberth and Henseler2023). The composite model more accurately captures the characteristics of emergent variables. Apart from this difference, CCA follows the same steps as CFA: model specification, model identification, model parameter estimation, and model assessment. A crucial development in CCA is the recently introduced Henseler–Ogasawara (H–O) specification of composites (Schuberth, Reference Schuberth2023, Yu et al., Reference Yu, Schuberth and Henseler2023), which makes it possible to conduct a CCA using conventional structural equation modeling (SEM) software packages such as Amos (Arbuckle, Reference Arbuckle2020), the R package lavaan (Rosseel, Reference Rosseel2012), and Mplus (Muthén & Muthén, 1998–Reference Muthén and Muthén2017). Consequently, researchers can gain all the benefits they are accustomed to in SEM with latent variables, e.g., dealing with missing values (e.g., Allison, Reference Allison2003; Muthén et al., Reference Muthén, Kaplan and Hollis1987), gaining access to well-established model fit measures (Schermelleh-Engel et al., Reference Schermelleh-Engel, Moosbrugger and Müller2003), and constraining parameters.
The remainder of this paper is structured as follows. In the next section, we argue that, in several cases, L2 researchers could be dealing with emergent variables rather than latent variables. As a contextual example, we refer to the Strategy Inventory for Language Learning (SILL; Oxford, Reference Oxford2011), arguing that CFA is limited when it comes to assessing the SILL because the common factor model underlying CFA does not fit the characteristics of the SILL. As a more suitable approach, we argue that CCA more accurately captures the conceptual definition of the SILL. In addition, we provide guidelines on how to use CCA. Subsequently, we demonstrate the use of CCA by applying it to the Metacognitive Awareness of Reading Strategies Inventory (MARSI; Mokhtari & Reichard, Reference Mokhtari and Reichard2002). We conclude our paper with a discussion and then indicate avenues for future research.
Confirmatory factor analysis (CFA) and its underlying assumptions
CFA is a de facto standard technique for empirically validating scales reflecting latent variables in L2 and education research (e.g., Alamer, Reference Alamer2022; Schreiber et al., Reference Schreiber, Nora, Stage, Barlow and King2006; Shao et al., Reference Shao, Shirvan and Alamer2022). It builds on the common factor model to describe the relationship between scale items and a construct. Specifically, the common factor model is grounded in classical test theory (e.g., Lord & Novick, Reference Lord and Novick2008); therefore, it assumes that the items are measurement error-prone manifestations of the construct, i.e., the latent variable causes the items (Fukuta et al., Reference Fukuta, Nishimura and Tamura2023). Consequently, correlations between items are expected as they are assumed to share a common cause, i.e., the latent variable that they purport to measure. For this reason, latent variables are often regarded as ontological entities: “If something does not exist, then one cannot measure it” (Borsboom et al., Reference Borsboom, Mellenbergh and Van Heerden2004, p. 1061). Conventionally, the common factor model assumes that the latent variable is solely responsible for the items’ covariance structure, i.e., the items are unidimensional. Unidimensionality in CFA implies that each item loads on one latent variable only, i.e., all cross-loadings are fixed to zero (e.g., Alamer & Marsh, Reference Alamer and Marsh2022; Marsh & Alamer, Reference Marsh and Alamer2024). Thus, the common factor model assumes, at least in theory, that items from a homogenous pool reflecting the latent variable can be interchanged or dropped without altering the construct’s meaning (e.g., Bollen & Bauldry, Reference Bollen and Bauldry2011). Table 1 summarizes the characteristics of the common factor model.
In the L2 research context, CFA and the common factor model have proven to be useful tools to empirically validate questionnaires that intend to measure phenomena that are not directly observable, i.e., scales that measure latent variables (Alamer & Marsh, Reference Alamer and Marsh2022; Marsh & Alamer, Reference Marsh and Alamer2024). Typical examples of latent variables from L2 research are L2 motivation, anxiety, enjoyment, and boredom. However, as we will explain in the following section, the use of CFA to empirically validate questionnaires intended to evaluate phenomena that are defined by a set of elements/parts, i.e., emergent variables, is limited.
Example from the literature illustrating the limitation of CFA in studying emergent variables
To illustrate the limitations of CFA and the common factor model in studying emergent variables, we focus on language learning strategies. Following Cohen (Reference Cohen2014, p. 7), language learning strategies can be defined as “[t]houghts and actions, consciously chosen and operationalized by language learners, to assist them in carrying out a multiplicity of tasks from the very onset of learning to the most advanced levels of target-language performance.” Similarly, learning strategies are “specific actions taken by the learner to make learning easier, faster, more enjoyable, more self-directed, more effective, and more transferable to new situations” (Oxford, Reference Oxford1990, p. 8) and “actions chosen by learners for the purpose of language learning” (Griffiths, Reference Griffiths2018, p. 88).
Although various classes of learning strategies have been proposed over the recent decades, the three most commonly studied classes are cognitive, affective, and social learning strategies (Oxford & Griffiths, Reference Oxford and Griffiths2016). To limit our focus, we discuss only cognitive strategies, which are defined as strategies that language students use to help them understand, transform, and apply their language knowledge (Oxford, Reference Oxford1992). To evaluate the degree to which a learner uses cognitive learning strategies, researchers frequently employ the SILL (Oxford, Reference Oxford1990; Oxford & Griffiths, Reference Oxford and Griffiths2016). In particular, they use the following SILL items to evaluate L2 learners’ use of cognitive strategies (rated on a Likert scale from 1: Never or almost never true of me, to 5: Always or almost always true of me; Oxford, Reference Oxford2011, pp. 102–136). In L2 English learning, the items would be:
-
Item 1: I connect the sound of a new English word and an image or picture of the word to help me remember the word.
-
Item 2: I use the English words I know in different ways.
-
Item 3: I find the meaning of an English word by dividing it into parts that I understand.
-
Item 4: I use new English words in a sentence so I can remember them.
-
Item 5: I try to find patterns (grammar) in English.
-
Item 6: I try not to translate word for word.
To judge the suitability of the common factor model and CFA for modeling and assessing the use of cognitive learning strategies, one should ask the following questions (Bollen & Bauldry, Reference Bollen and Bauldry2011; see also “Model specification” in Table 2): Does a change in the construct lead to a change in all items, i.e., does an increase in the use of cognitive strategies entail an increase in all the items, from 1 to 6? Can the items in principle be interchanged or removed without altering the meaning of the construct? Do we expect high correlations between the items because they share a common cause? For example, does a respondent who answers a certain way to Item 3 “I find the meaning of an English word by dividing it into parts that I understand” respond in a similar way to Item 6 “I try not to translate word for word”?
Considering the questions posed above, we answer all of them with a no. This puts the appropriateness of the common factor model and the application of CFA into question for their ability to model and assess the use of cognitive learning strategies. In contrast, we argue that the use of cognitive learning strategies is an emergent rather than a latent variable, which is composed of elements/parts. Specifically, each item of the SILL represents particular information about a cognitive strategy and how a given learner uses it to master the L2. Together, these items determine how much a learner uses cognitive learning strategies. Consequently, items are not interchangeable because each item is unique and represents a different cognitive strategy. Additionally, removing an item from the construct would probably alter its meaning because the dropped strategy (i.e., the item) cannot be recovered conceptually by any other item in the inventory. Furthermore, substantial correlations between items are not necessarily expected because learners can apply strategies differently. For instance, they might use English words they know in different ways (i.e., Item 2) but not divide the English word into its constituent parts to help determine its meaning (i.e., Item 3). Moreover, the SILL used to evaluate a learner’s use of language learning strategies has often not been empirically supported by the results of previous empirical studies using CFA. For example, various studies observed fit measures indicating unacceptable model fit and/or low factor loading estimates (Hsiao & Oxford, Reference Hsiao and Oxford2002; Paige et al., Reference Paige, Cohen and Shively2004; Tragant et al., Reference Tragant, Thompson and Victori2013). In response to this counterevidence, previous studies have proposed that some of the constructs or items should be removed (Habók & Magyar, Reference Habók and Magyar2018; Hsiao & Oxford, Reference Hsiao and Oxford2002; Yeh, Reference Yeh2014). Overall, it seems that the common factor model does not adequately support the conceptual definition of the use of cognitive learning strategies and thus the SILL. From a theoretical perspective, we argue that the use of cognitive learning strategies should rather be identified as an emergent variable; thus, the SILL should be evaluated via CCA and the composite model.
Confirmatory composite analysis (CCA)
CCA was proposed as a tool to empirically assess composite models (Henseler et al., Reference Henseler, Dijkstra, Sarstedt, Ringle, Diamantopoulos, Straub, Ketchen, Hair, Hult and Calantone2014; Schuberth et al., Reference Schuberth, Henseler and Dijkstra2018) and introduced to various research fields including business (Henseler & Schuberth, Reference Henseler and Schuberth2020), information systems (Hubona et al., Reference Hubona, Schuberth and Henseler2021), tourism and hospitality (Liu et al., Reference Liu, Schuberth, Liu and Henseler2022), and human development (Schamberger et al., Reference Schamberger, Schuberth and Henseler2023). The first application of CCA in the L2 domain was done by Alamer and Alsagoafi (Reference Alamer and Alsagoafi2023) who empirically tested the validity of the Revised Metacognitive Awareness of Reading Strategies Inventory (MARSI-R; Mokhtari et al., Reference Mokhtari, Dimitrov and Reichard2018). In their study, the authors compared the results of CCA and CFA when examining the validity of MARSI-R. They found support for CCA as model fit indices were acceptable in the CCA but not in the CFA, and item weights functioned in the expected direction. Given these recent findings, it appears that the L2 field warrants guidelines and practical tutorials for using CCA (Alamer et al., Reference Alamer, Teng and Mizumotoin press). In the following sub-sections, we present the four steps of CCA, i.e., model specification, model identification, model estimation, and model assessment.
Model specification
To study emergent variables, the composite model can be used (Henseler & Schuberth, Reference Henseler and Schuberth2020). At the heart of the composite model is an emergent variable and not a latent variable. First and foremost, an emergent variable is a weighted linear combination of elements, i.e., it is a composite. Hence, the composite model assumes that the construct is fully defined by its elements. Consequently, an emergent variable is not assumed to exist independent of its elements. This contrasts with the latent variable in the common factor model, which is measured and therefore assumed to exist independent of its measures (Borsboom et al., Reference Borsboom, Mellenbergh and Van Heerden2004). However, as Henseler and Schuberth (Reference Henseler and Schuberth2023) noted, if the elements exit, so does the emergent variable. Since each element plays a constituent role in forming the construct, omitting an element will alter the construct’s meaning in the composite model. An additional and important property of an emergent variable is that it accounts for the covariances between its elements and other variables in the model. This property is expressed by the axiom of unity (Henseler & Schuberth, Reference Henseler, Schuberth and Henseler2021a). Thus, an emergent variable conveys all the information shared between its elements and other variables of the model (Dijkstra, Reference Dijkstra2013; Reference Dijkstra, Latan and Noonan2017). Hence, an emergent variable acts as a whole and not as a mere loose collection of elements (Henseler & Schuberth, Reference Henseler, Schuberth and Henseler2021b). Table 1 summarizes the characteristics of the composite model. Note that composites are usually depicted by hexagons (e.g., Grace & Bollen, Reference Grace and Bollen2008). However, most SEM software packages with a graphical interface have not implemented this graphical representation yet.
Initially, the proposition was to express emergent variables in CCA by means of weights because this is highly intuitive (Schuberth et al., Reference Schuberth, Henseler and Dijkstra2018). However, this prevents a researcher from conducting CCA in SEM, limiting its ability to benefit from SEM’s advantages, such as obtaining fit measures and dealing with missing values (Schuberth, Reference Schuberth2023). This is because an emergent variable will always be modeled as a dependent variable. For this reason, it is not possible to specify covariances between an emergent variable and other variables, which is an essential requirement for conducting CCA. Therefore, taking such an approach, researchers could only specify covariances between the emergent variable’s disturbance term and other variables in the model. However, since an emergent variable is assumed to be fully determined by its elements, the variance of this disturbance term must be constrained to zero. Besides the fact that covariances with a constrained disturbance term cannot be specified, this clearly contradicts specifying covariances between emergent variables.
To overcome this limitation, in this study, we use the H–O specification (Henseler & Schuberth, Reference Henseler, Schuberth and Henseler2021b; Schuberth, Reference Schuberth2023), in particular its refined version (Yu et al., Reference Yu, Schuberth and Henseler2023). In the H–O specification, not only a single composite, but as many composites as elements are formed from a set of elements, i.e., one emergent and several excrescent variables. The emergent variable depicts the construct of interest, whereas the excrescent variables have no further meaning. They are merely formed to span the space of the elements together with the emergent variable. This approach resembles a principal component analysis (Hotelling, Reference Hotelling1933) in which as many principal components as variables are extracted. For a more technical description, we refer the reader to Schuberth (Reference Schuberth2023) and Schamberger et al. (Reference Schamberger, Schuberth and Henseler2023).
Figure 1 presents an example of the H–O specification in the SEM software Amos. In this example, the emergent variable is formed by five elements; consequently, four excrescent variables (ex1 to ex4) are specified. Notably, the elements are assumed to be free from random measurement error. Moreover, the emergent variable must be connected to at least one other variable of the model as indicated by the double-headed arrow, as we will elaborate in the next subsection about model identification. This additionally highlights that emergent variables are context-specific, i.e., their meaning also depends on the model’s other variables (Yu et al., Reference Yu, Zaza, Schuberth and Henseler2021). Since Amos software does not allow for drawing hexagons, the emergent and excrescent variables are displayed as ovals in Figure 1.
To facilitate the application of CCA in the R environment (R Core Team, 2022) using the lavaan package (Rosseel, Reference Rosseel2012), the R function specifyHO can be used.Footnote 1 In doing so, the user must specify the model’s emergent variables in lavaan syntax using the ‘<∼’ operator. Subsequently, this model can be applied to the specifyHO function to obtain lavaan model syntax in which emergent variables are specified in compliance with the H–O specification. Subsequently, this obtained model syntax can be used as input for the sem function of the lavaan package to conduct CCA.
Model identification
To ensure that the model parameters are identified, constraints must be imposed on the parameters. This involves determining the variances of the emergent and excrescent variables. For this purpose, we set one composite loading for each emergent and excrescent variable to one. In this regard, no element can serve as a scaling variable more than once. In our example model, depicted in Figure 1, each element is used only once as scaling indicator, i.e., only one of its composite loadings is constrained to 1. For example, the composite loading of Element 1 on the emergent variable is constrained to 1. Similarly, Element 2 shows that a composite loading on the excrescent variable ex1 is constrained to 1. Further, the emergent variable must be uncorrelated with the excrescent variables, in this case ex1 to ex4. Also, the emergent variable must be related to at least one other variable in the model other than its elements, e.g., to another observed, latent, or emergent variable in the model. In our illustrative model, this is indicated by the double-headed arrow. In contrast, the excrescent variables are only allowed to correlate with one another, as Figure 1 shows. Consequently, the emergent variable fully accounts for the covariances between the elements and other variables in the model. In other words, all information between the elements and other variables in the model is conveyed by the emergent variable (Dijkstra, Reference Dijkstra, Latan and Noonan2017). Further, one needs to ensure that the excrescent variables span the remaining space of the elements that is not spanned by the emergent variable (Schuberth, Reference Schuberth2023). To this end, we fix all composite loadings of each excrescent variable to zero except for two, namely the composite loading that is fixed to one to determine the excrescent variable’s variance and one composite loading that is freely estimated. For example, in Figure 1, the composite loading of Element 1 on the first excrescent variable ex1 is a free model parameter, whereas all other composite loadings on that excrescent variable are constrained to 1 or 0. The composite loadings of the other excrescent variables are fixed in similar fashion. In fixing the composite loadings of the excrescent variables, it must be ensured that no excrescent variables are connected to the exact same elements. Finally, by default, most SEM software applications specify random measurement errors connected to the elements. In this case, the variances of these error terms must be constrained to zero.
Model estimation
Once identification of the model parameters has been ensured, they can be estimated. The H–O specification allows us to draw on different kinds of SEM estimators such as maximum-likelihood (ML) (Jöreskog, Reference Jöreskog1970) or generalized least squares (GLS) (Browne, Reference Browne1974). As a result, researchers applying CCA can gain all the benefits that they are accustomed to having with SEM, e.g., fixing parameters and gaining access to well-established model fit indices (Kline, Reference Kline2015, Chapter 12).
A supposed disadvantage of the H–O specification is that weight estimates are not obtained by default because the relationships between the emergent and excrescent variables and their components are expressed by composite loadings instead of weights. However, as shown in Schuberth (Reference Schuberth2023) and Schamberger et al. (Reference Schamberger, Schuberth and Henseler2023), the weight estimates can be retrieved from the inverted composite loading matrix. As most SEM software applications allow users to specify new parameters, this feature can be exploited to obtain the (standardized) weight estimates. For an explanation of how the weights can be obtained from the composite loadings, we refer the reader to Schuberth (Reference Schuberth2023) and Yu et al. (Reference Yu, Schuberth and Henseler2023). Further, the specifyHO function offers the option of determining weights.
Model assessment
In CCA’s final step, the model is assessed, and its parameter estimates are interpreted. This involves assessing the overall model fit and assessing the emergent variables (Henseler & Schuberth, Reference Henseler and Schuberth2020; Schuberth et al., Reference Schuberth, Henseler and Dijkstra2018). As in CFA, overall model assessment is crucial in CCA and typically involves considering the outcomes of the exact model fit test and various fit indices. If the estimated model’s fit is found to be unacceptable, then the elements forming the emergent variable probably act not as a new whole, but rather as merely a loose collection of parts. Consequently, researchers are urged to consider the elements individually or to modify their models.
To assess overall model fit in CCA, researchers can, in principle, draw on all that is known through CFA and SEM. This includes the chi-square test to assess the exact overall model fit (Jöreskog, Reference Jöreskog1967). However, because testing the exact overall model fit has been criticized as unrealistic (e.g., Bollen, Reference Bollen1989, Chapter 7), various fit indices have been proposed to gauge model fit. These include the standardized root mean square residual (SRMR; Bentler, Reference Bentler1995), the comparative fit index (CFI; Bentler, Reference Bentler1990), the Tucker-Lewis index (TLI; Tucker & Lewis, Reference Tucker and Lewis1973), and the root mean square error approximation (RMSEA; Steiger, Reference Steiger2016). Although existing studies have indicated that fit indices can detect misspecified composite models (Schuberth et al., Reference Schuberth, Henseler and Dijkstra2018; Reference Schuberth, Rademaker and Henseler2022), future research still has to reassess their cut-off values for composite models.
Besides the overall model fit assessment, parameter estimates should be investigated. In this context, the composite loading and weight estimates are of particular interest. The emergent variables’ composite loadings are the covariances between an element and the corresponding emergent variable. Therefore, they show an element’s absolute contribution to the emergent variable (Cenfetelli & Bassellier, Reference Cenfetelli and Bassellier2009). Further, the composite loadings provide information on the orientation of an emergent variable. Specifically, the scaling indicator, i.e., the element whose loading was constrained to 1, determines the orientation of the emergent variable. If it eventually appears that the other elements forming the particular emergent variable show negative composite loadings—even if they are expected to correlate positively with that emergent variable—the researcher should either reconsider the scaling variable or fix the loading of the scaling variable to -1 instead of 1, to ensure the correct orientation of the emergent variable. In addition, the magnitude and significance of the composite loading estimates should be assessed, e.g., by considering the outcome of the z-test or confidence intervals. Furthermore, researchers who are interested in the composition of an emergent variable or want to calculate emergent variables’ scores should consider the weight estimates. Note that weight estimates are affected by multicollinearity, i.e., correlations among the elements, which can lead to differences in the signs of the composite loading and weight estimates. Finally, researchers should take criterion validity into account by considering concurrent and/or predictive validity (e.g., Piedmont, Reference Piedmont and Michalos2014). This is done by examining the extent to which an emergent variable correlates with a criterion variable.
Against the description above, we emphasize that CCA is not a replacement for CFA. While CFA is based on the common factor model to empirically validate latent variables and their measures, CCA is based on the composite model to empirically validate emergent variables and their elements. Consequently, the two techniques make different assumptions about the type of construct and serve different purposes. Therefore, their parameter estimates should not be compared as they have different conceptual meanings.
Illustrative example
In this section, we present an illustrative example from second language learning research to demonstrate the application of CCA following the steps presented in Table 2. Specifically, we consider the Metacognitive Awareness of Reading Strategies Inventory (MARSI), which assesses learners’ awareness and use of reading strategies while reading academic texts (Mokhtari & Reichard, Reference Mokhtari and Reichard2002). Originally, this inventory consisted of 30 strategy statements belonging to one of the following three strategy classes: (i) global reading strategies (GRS), (ii) problem-solving strategies (PSS), and (iii) support reading strategies (SRS). Because the fit of the common factor solution was not satisfactory, the MARSI was revised to result in a shortened version, i.e., the MARSI-R (Mokhtari et al., Reference Mokhtari, Dimitrov and Reichard2018), which has five items per construct, where each item refers to a different reading strategy.
The dataset we used in our illustrative example was collected and studied by Ondé et al. (Reference Ondé, Jiménez, Alvarado and Gràcia2022) and is publicly available.Footnote 2 It consists of 548 valid student responses to the MARSI-R, including a variable measuring self-reported reading level, referred to as READ. The students were enrolled in compulsory secondary education at various educational centers in Barcelona and Madrid (Spain). For more detail on data collection and the sample, we refer the reader to Ondé et al.’s (Reference Ondé, Jiménez, Alvarado and Gràcia2022) original study. We conducted our CCA in the statistical programming environment R (R Core Team, 2022) using the lavaan package (Rosseel, Reference Rosseel2012, version 0.6–13) and the semTools package (Jorgensen et al., Reference Jorgensen, Pornprasertmanit, Schoemann and Rosseel2022, version 0.5–6).Footnote 3 The semTools package was used to calculate the confidence intervals of the weight estimates.
Model specification
As explained in the previous section, the use of a strategy class can be considered an emergent variable. Considering the items of the MARSI-R, we argue that the various items determine the use of the three strategy classes, i.e., they define the three constructs instead of measuring them. Consequently, removing an item would most likely alter the meaning of the constructs. Therefore, we employed the composite model and CCA to empirically validate the use of the three strategy classes, namely GRS, PSS, and SRS. The use of each strategy class was modeled as an emergent variable composed of the corresponding five items from the MARSI-R. Additionally, we added the READ variable to assess criterion validity. Figure 2 shows the specified model. To guide practitioners using SEM software with a graphical interface, this figure presents the specification in Amos (Arbuckle, Reference Arbuckle2020). To specify the model in lavaan, researchers can use the user-written R function specifyHO.
Model identification
To ensure that the parameters are identified, we have employed the rules presented in the previous section. As Figure 2 shows, the composite loadings were constrained appropriately, and each item served only once as scaling indicator. Further, the excrescent variables were correlated only with the excrescent variables of their block and not with other variables in the model. Finally, each emergent variable was connected to at least one other variable besides its elements. In our case, the three emergent variables, i.e., GRS, PSS, and SRS, and the READ variable were allowed to covary.
Model estimation
The items of the dataset showed a mild degree of non-normality, i.e., skewness ranging from -1.57 to 0.13 and excess kurtosis ranging from -1.34 to 1.40. To account for the non-normality in the items, we used the maximum likelihood estimator with robust standard errors to estimate the model parameters, including a Satorra-Bentler scaled test statistic (MLM; Satorra & Bentler, Reference Satorra, Bentler, von Eye and Clogg1994) as implemented in the R package lavaan. The model estimation terminated normally.
Model assessment
To assess the model, we followed our guidelines as given in Table 2. In doing so, we considered the overall model fit. The chi-square test rejected the null hypothesis of exact fit (χ2 = 156.97, df = 72, p < 0.01). As a supplement, we considered various indices to judge model fit. The SRMR equaled 0.042, indicating a good model fit. Similarly, the robust RMSEA equaled 0.051, with a 90% confidence interval ranging from 0.040 to 0.062, thus also indicating a good model fit. Finally, the robust CFI and TLI equaled 0.934 and 0.891, respectively. As a result, we regarded the fit of the composite model to be acceptable.
Table 3 shows the standardized weight estimates, including their 95% confidence intervals. As this table demonstrates, all standardized weights were positive, i.e., all the elements contributed positively to forming their corresponding construct. Regarding the confidence intervals of the standardized weights, none contained zero except the standardized weight of PSS1, which indicates that PSS1 did not contribute significantly to PSS. However, following the guidelines in Table 2, in the next step we inspected the estimated standardized composite loadings. Results revealed that the standardized composite loading of PSS1 on PSS was both sizable and significant. Thus, we decided to keep PSS1 in order not to risk altering the meaning of the emergent variable PSS (Benitez et al., Reference Benitez, Henseler, Castillo and Schuberth2020). Similarly, all other elements showed a positive and significant composite loading with their corresponding emergent variable. In addition, we report the correlations among the three emergent variables of GRS, PSS, and SRS which were within a reasonable range, i.e., r (PSS, GRS) = 0.571 (95% CI [0.509, 0.633]), r (GRS, SRS) = 0.576 (95% CI [0.513, 0.639]), and r (PSS, SRS) = 0.568 (95% CI [0.504, 0.632]).
Note: λstd = standardized composite loadings. $ w $ std = standardized composite weights
Finally, we examined the criterion validity of the emergent variables by considering the extent to which the three emergent variables GRS, PSS, and SRS correlated with students’ self-perception of their reading level (READ). From a theoretical point of view, this measure was expected to be correlated positively with GRS, PSS, and SRS (e.g., Mokhtari et al., Reference Mokhtari, Dimitrov and Reichard2018). With respect to our results, correlations between READ and GRS, PSS, and SRS were all positive and significant: r (READ, GRS) = 0.337 (95% CI [0.257, 0.418]), r (READ, PSS) = 0.328 (95% CI [0.254, 0.403]), and r (READ, SRS) = 0.247 (95% CI [0.164; 0.330]). Consequently, we find no violation of criterion validity. Overall, our results are in line with our hypothesis that GRS, PSS, and SRS behave as emergent variables.
Discussion
Researchers in the L2 and education domain frequently use questionnaires and inventories to collect data about their constructs of interest (Dörnyei & Dewaele, Reference Dörnyei and Dewaele2022). To validate such tools, L2 and education researchers regularly rely on CFA (or more recently, ESEM), which is based on the common factor model (Alamer, Reference Alamer2022; Alamer et al., Reference Alamer, Morin, Alrabai and Alharfi2023; Marsh & Alamer, Reference Marsh and Alamer2024). Although CFA and the common factor model have proven to be useful in empirically validating questionnaires intended to measure latent variables, as explained in this paper, this approach has limited use for empirically validating inventories in which items make up the constructs, so-called emergent variables. This is because CFA assesses the factorial structure implied by the existence of a latent variable. However, this ignores important characteristics of emergent variables, which are not measured but composed of their constituting elements. A more suitable method for assessing emergent variables is CCA, which is based on the composite model and which our study has introduced into the education and language learning domains.Footnote 4
To demonstrate the application of CCA, we made use of an illustrative example. For this reason, we used a publicly available dataset and considered the MARSI-R, which is an inventory designed to evaluate the perceived use of three reading strategy classes, i.e., global reading strategies, problem-solving reading strategies, and support reading strategies in L2 learning. Each of the 15 items captures the use of a specific strategy from one of the three classes, i.e., each item is unique and not interchangeable. Therefore—and in contrast to previous studies—we argue that the use of each strategy class is determined and not measured by its items. Consequently, we modeled the use of the three strategy classes by means of the composite model, which we assessed via CCA. Our results show that the model fit indices were within an acceptable range. Further, all composite loading estimates were both positive and significant, indicating that each strategy contributes in absolute terms to the use of its strategy class (Cenfetelli & Bassellier, Reference Cenfetelli and Bassellier2009). Similarly, most weights (except one) were both significant and positive, showing that each strategy makes a unique contribution to the use of the strategy class to which it is assigned.
To perform our analysis, we mainly used the lavaan R package (Rosseel, Reference Rosseel2012) and complemented the analysis with semTools (Jorgensen et al., Reference Jorgensen, Pornprasertmanit, Schoemann and Rosseel2022). We deliberately opted for R and its packages as they are widely used and available free of charge. In addition, lavaan allows for specification of new parameters, which is essential for obtaining the (standardized) weight estimates. For this purpose, the user-written function specifyHO was developed for the readers of this paper and can be used freely. Further, the most recent lavaan version, version 0.6.13 and above, shows good convergence behavior in comparison to other SEM software packages. However, CCA can also be conducted using commercial SEM software such as Amos (Arbuckle, Reference Arbuckle2020) as our visual representations of the composite model illustrate. For software tutorials on CCA, we refer the reader to www.confirmatorycompositeanalysis.com.
Finally, researchers may feel tempted to compare CCA and CFA results. However, it is important to note that the two techniques serve different purposes and therefore the decision whether to use CFA or CCA should be based on theoretical arguments. Due to conceptual differences between CCA and CFA, researchers should not compare their parameter estimates, such as comparing composite loading values with factor loadings, as they have different conceptual meanings.
Extensions to CCA
Although we used CCA in our study, there are various possible extensions to it. For instance, latent variables can be included in the analysis, i.e., a CCA and a CFA can be conducted jointly. In such a case, we have a confirmatory composite factor analysis (CCFA; Hubona et al., Reference Hubona, Schuberth and Henseler2021), which can be particularly valuable for researchers who study both latent and emergent variables simultaneously and who want to follow the two-step approach known from SEM (Anderson & Gerbing, Reference Anderson and Gerbing1988). Specifically, in the first step, a CCFA is conducted to assess the composite and common factor models and in the second step, the emergent and latent variables are embedded in a structural model together with their items. For example, past research has shown that motivation has an impact on the use of GRS, PSS, and SRS (e.g., Alamer & Alsagoafi, Reference Alamer and Alsagoafi2023). To analyze such a situation, in the first step, a CCFA can be conducted to assess the composite and common factor models used to model the four constructs. As shown in Figure 3, motivation is modeled as a latent variable, while the use of each strategy class is modeled as an emergent variable. If no evidence against the validity of the composite and common factor models become apparent, a second step follows in which the substantive theory is assessed, i.e., the emergent and latent variables are embedded in a structural model, as shown in Figure 4.
Finally, various inventories for evaluating the use of strategies have been empirically validated using CFA. Since the outcome was often not satisfactory, ad hoc modifications were applied, e.g., by removing items or allowing multiple measurement errors to be correlated. These actions are often not theoretically justifiable. For instance, the MARSI, which originally consisted of 30 items, was reduced to the MARSI-R consisting of 15 items (Mokhtari et al., Reference Mokhtari, Dimitrov and Reichard2018) following data-driven ad hoc modifications. This might have resulted in important strategies being sacrificed. Therefore, we suggest that future research should re-evaluate such inventories using CCA (e.g., Alamer & Alsagoafi, Reference Alamer and Alsagoafi2023).
Conclusion
Past L2 researchers have mainly used CFA to assess their inventories, including those that evaluate emergent variables, i.e., constructs that are composed of elements/parts. As we have explained, for emergent variables CFA should not be the method of choice because it is based on the common factor model that does not align with the definition of emergent variables. The characteristics of emergent variables are captured more accurately in the composite model. As this paper proposes, CCA can be used to assess composite models.
Originally, PLS-PM and approaches to generalized canonical correlation analysis were proposed for model parameter estimation in CCA (Henseler et al., Reference Henseler, Dijkstra, Sarstedt, Ringle, Diamantopoulos, Straub, Ketchen, Hair, Hult and Calantone2014; Schuberth et al., Reference Schuberth, Henseler and Dijkstra2018). However, researchers faced various limitations, e.g., it is not possible to impose parameter constraints or deal with missing values using the full-information maximum likelihood (FIML) method, and there is only limited access to well-known fit indices (Schuberth et al., Reference Schuberth, Rademaker and Henseler2022). To overcome such limitations, in this study, we relied on the recently proposed H–O specification that allows researchers to conduct CCA using conventional SEM software applications such as Amos, and Mplus (Schuberth, Reference Schuberth2023, Yu et al., Reference Yu, Schuberth and Henseler2023). In this way, researchers conducting a CCA can gain all the benefits that they are accustomed to when using SEM with latent variables. A supposed disadvantage of the H–O specification is that the weight estimates are not obtained by default because the relationships between the emergent and excrescent variables and their elements are expressed by composite loadings. However, as Schuberth (Reference Schuberth2023) and Yu et al. (Reference Yu, Schuberth and Henseler2023) showed, the weight estimates can be retrieved from the inverted composite loading matrix. Also, using the user-written function specifyHO makes it easy to obtain weight estimates automatically.
We have demonstrated the use of CCA by means of an illustrative example. Specifically, we considered the MARSI-R, which is an inventory for evaluating the use of different reading strategies in reading academic texts. To facilitate the application of CCA, we used the R open-source software (R Core Team, 2022) and a publicly available dataset (Ondé et al., Reference Ondé, Jiménez, Alvarado and Gràcia2022). The R code used for the analysis, including our user-written R function specifyHO, is freely available. In addition, we have presented model illustrations using Amos to show the reader how to specify models in software that offers a graphical interface. In this way, we hope that future research will benefit from CCA to the greatest extent possible and that researchers will consider revisiting the validity of their inventories by application of CCA.
Acknowledgments
This work was supported by the Deanship of Scientific Research, Vice Presidency for Graduate Studies and Scientific Research, King Faisal University, Saudi Arabia [Grant No. 5589].
Jörg Henseler acknowledges a financial interest in the composite-based SEM software ADANCO and its distributor, Composite Modeling. Moreover, he gratefully acknowledges financial support from FCT Fundação para a Ciência e a Tecnologia (Portugal), national funding through a research grant from the Information Management Research Center– MagIC/NOVA IMS (UIDB/04152/2020).
Moreover, we thank Daniel Ondé and colleagues for giving us permission to use their dataset in our illustrative example. We also thank Alexandra Elbakyan for her efforts in making science accessible.