1 Introduction
Cognitive diagnostic assessments (CDAs) have developed rapidly over the past several decades, and they are widely used in educational and psychological research (de la Torre, Reference de la Torre2009, Reference de la Torre2011; de la Torre & Douglas, Reference de la Torre and Douglas2004; DiBello et al., Reference DiBello, Roussos, Stout, Rao and Sinharay2007; Haberman & von Davier, Reference Haberman, von Davier, Rao and Sinharay2007; Henson et al., Reference Henson, Templin and Willse2009; Junker & Sijtsma, Reference Junker and Sijtsma2001; Rupp et al., Reference Rupp, Templin and Henson2010; Templin & Henson, Reference Templin and Henson2006; von Davier, Reference von Davier2014a). The primary motivation for the development of CDAs is to ascertain whether or not a student has mastered some fine-grained skills or attributes that are required to solve a particular item. More specifically, not only can CDAs be used to analyze in detail the strengths and weaknesses of students in the areas they are learning, but they can also provide powerful tools to help teachers improve classroom instruction.
There is a wide variety of cognitive diagnostic models (CDMs) available in the published CDA literature (DiBello et al., Reference DiBello, Roussos, Stout, Rao and Sinharay2007; Rupp & Templin, Reference Rupp and Templin2008b), and many of these are built on strong cognitive assumptions about the processes involved in problem-solving. These CDMs can be broadly classified into three different types: compensatory, non-compensatory, and general models. Compensatory models are based on the assumption of attribute compensation, which means that although the examinee may not have mastered all the attributes involved in an item, they are still more likely to score well on that item if they have mastered some of its attributes. This is because the attributes that the examinee has mastered can “compensate” for the other attributes that they have not mastered. The most famous compensatory model is the deterministic inputs, noisy “or” gate (DINO) model (Templin & Henson, Reference Templin and Henson2006) and the linear logistic model (LLM; Maris, Reference Maris1999). In contrast, non-compensatory models are constructed under the assumption of attribute conjunction, which means that under the assumption of an ideal response, an examinee can score on an item only after mastering all of the attributes involved in that item; otherwise, he or she will not be able to answer the item correctly. The widely used non-compensatory (conjunctive) models are the deterministic inputs, noisy and gate (DINA) model (Haertel, Reference Haertel1989; Junker & Sijtsma, Reference Junker and Sijtsma2001; Macready & Dayton, Reference Macready and Dayton1977) and the reduced reparameterized unified model (rRUM; Hartz, Reference Hartz2002). Some general CDM frameworks have also been established that include a variety of widely applied CDMs, such as the log-linear CDM (LCDM; Henson et al., Reference Henson, Templin and Willse2009), the generalized DINA (GDINA; de la Torre, Reference de la Torre2011) model, and the general diagnostic model (von Davier, Reference von Davier2008). Although DINA, DINO, rRUM, and LLM were developed from different application backgrounds, they can in fact be viewed as special cases of the LCDM by restricting certain parameters to zero in its saturated version. Henson et al. (Reference Henson, Templin and Willse2009) detailed how the LCDM can be transformed into our traditional models such as DINA, DINO, rRUM, and LLM through parameter restrictions. Additionally, Ma and de la Torre (Reference Ma and de la Torre2016) elucidated that the LCDM and GDINA models are equivalent in their saturated forms.
Parameter estimation is the basis of model applications, and it is a prerequisite for interpretation of complicated data in the field of educational psychology. Several strategies have been developed to estimate the parameters of CDMs. Algorithms based on maximum likelihood have been widely used to estimate CDMs in the frequency framework. Examples using a marginal maximum likelihood method to estimate the parameters of several CDMs via an Expectation-Maximization (EM) algorithm (Dempster et al., Reference Dempster, Laird and Rubin1977) can be found in the literature (de la Torre, Reference de la Torre2009, Reference de la Torre2011; Ma & de la Torre, Reference Ma and de la Torre2016; Ma & Guo, Reference Ma and Guo2019; Maris, Reference Maris1999). Some available R packages, such as “CDM” (George et al., Reference George, Robitzsch, Kiefer, Groß and Ünlü2016) and “GDINA” (Ma & de la Torre, Reference Ma and de la Torre2020), have been developed to estimate CDM parameters. However, algorithms based on maximum likelihood have some disadvantages, as elaborated by Yamaguchi and Templin (Reference Yamaguchi and Templin2022); for example, there is the possibility of a local maximum being reached by a maximum likelihood algorithm. Accordingly, it is challenging to discern whether parameter estimates are obtained from a global maximum, even if a multiple starting value method is used to evaluate their optimality. In addition, calculation of the variability (standard errors) of parameter estimates depends on asymptotic theory in the likelihood framework, and an asymptotic distribution with parameter restrictions may not be correct when small sample sizes are involved.
In parallel with maximum likelihood-based methods, Bayesian statistical methods have also gained widespread attention for inferring various types of CDM parameters (e.g., Chung, Reference Chung2019; Culpepper, Reference Culpepper2015, Reference Culpepper2019, Culpepper & Hudson, Reference Culpepper and Hudson2018; DeCarlo, Reference DeCarlo2012; de la Torre & Douglas, Reference de la Torre and Douglas2004; Henson et al., Reference Henson, Templin and Willse2009; Jiang & Carter, Reference Jiang and Carter2019; Liu, Reference Liu2022; Liu et al., Reference Liu, Andersson and Skrondal2020; Zhan et al., Reference Zhan, Jiao, Man and Wang2019). More specifically, de la Torre and Douglas (Reference de la Torre and Douglas2004) implemented a Metropolis–Hastings (MH) algorithm for estimating the higher-order DINA model parameters. Henson et al. (Reference Henson, Templin and Willse2009) also adopted the MH algorithm to estimate LCDM parameters. Liu et al. (Reference Liu, Andersson and Skrondal2020) and Liu (Reference Liu2022) developed the Metropolis–Hastings Robbins–Monro (MH-RM) algorithm (Cai, Reference Cai2010) to estimate CDM parameters. With the help of conjugate prior distributions, Culpepper (Reference Culpepper2015) proposed a Gibbs sampling algorithm to estimate the parameters of the DINA model; the corresponding R package “dina” was developed by Culpepper in (Reference Culpepper2015). On the basis of the work of Culpepper (Reference Culpepper2015), a new No-U-Turn Gibbs sampler was proposed by da Silva et al. (Reference da Silva, de Oliveira, von Davier and Bazán2018) to estimate the parameters of the DINA model. In addition, the Gibbs sampling algorithm has also been used for updating the $\text {Q}$ -matrix in CDMs (Chung, Reference Chung2019; Culpepper, Reference Culpepper2019; Culpepper & Hudson, Reference Culpepper and Hudson2018). DeCarlo (Reference DeCarlo2012) developed the software OpenBUGS (Thomas et al., Reference Thomas, O’Hara, Ligges and Sturtz2006) for estimating reparameterized DINA model parameters. Zhan et al. (Reference Zhan, Jiao, Man and Wang2019) published a tutorial for estimating various types of CDM estimation using the R package “R2jags” (Su & Yajima, Reference Su and Yajima2015), which is associated with the JAGS program (Plummer, Reference Plummer2003). Jiang and Carter (Reference Jiang and Carter2019) estimated the parameters of the LCDM by means of the Hamiltonian Monte Carlo (HMC) algorithm (Neal, Reference Neal and Brooks2011) in the Stan program (Carpenter et al., Reference Carpenter, Gelman, Hoffman, Lee, Goodrich, Betancourt, Brubaker, Guo, Li and Riddell2017). However, the computationally intensive nature of Markov chain Monte Carlo (MCMC) estimation for the CDM parameters presents a major hurdle to its widespread use in the empirical application of Bayesian approaches to the study of education when faced with large samples, numerous items, numerous attributes, and complex models (Yamaguchi & Okada, Reference Yamaguchi and Okada2020a; Oka et al., Reference Oka, Saso and Okada2023).
Researchers have recently become interested in the variational inference (VI) method as a more flexible and less computationally intensive alternative to traditional Bayesian statistical methods (Bishop, Reference Bishop2006; Blei et al., Reference Blei, Kucukelbir and McAuliffe2017; Cho et al., Reference Cho, Wang, Zhang and Xu2021; Grimmer, Reference Grimmer2011; Jaakkola & Jordan, Reference Jaakkola and Jordan2000; Jeon et al., Reference Jeon, Rijmen and Rabe-Hesketh2017; Oka & Okada, Reference Oka and Okada2023; Rijmen et al., Reference Rijmen, Jeon, Rabe-Hesketh and van der Linden2016; Urban & Bauer, Reference Urban and Bauer2021; Yamaguchi, Reference Yamaguchi2020; Yamaguchi & Martinez, Reference Yamaguchi and Martinez2023; Yamaguchi & Okada, Reference Yamaguchi and Okada2020a, Reference Yamaguchi and Okada2020b). Compared to the traditional MCMC methods, the VI method is a deterministic approximation approach that is based on posterior density factorization. This method accomplishes its goal of rapidly and efficiently dealing with large amounts of complex educational psychology data (e.g., large numbers of samples, items, and attributes) by transforming the statistical inference problem of the posterior density into an optimization problem. In view of their many benefits, VI algorithms have been developed to estimate a variety of psychological models such as item response theory models (Rijmen et al., Reference Rijmen, Jeon, Rabe-Hesketh and van der Linden2016; Urban & Bauer, Reference Urban and Bauer2021), generalized linear mixed models (Jeon et al., Reference Jeon, Rijmen and Rabe-Hesketh2017), and CDMs (Oka et al., Reference Oka, Saso and Okada2023; Oka & Okada, Reference Oka and Okada2023; Yamaguchi, Reference Yamaguchi2020; Yamaguchi & Martinez, Reference Yamaguchi and Martinez2023; Yamaguchi & Okada, Reference Yamaguchi and Okada2020a, Reference Yamaguchi and Okada2020b).
Recently, Yamaguchi and Okada (Reference Yamaguchi and Okada2020b) introduced a VI method specifically tailored for the DINA model, marking a significant advancement in this field. This method was derived based on the optimal variational posteriors for each model parameter. Subsequently, Yamaguchi (Reference Yamaguchi2020) further extended VB inference applications by developing an algorithm for the multiple-choice item of the DINA model (MC-DINA). This extension to MC-DINA demonstrated the flexibility and computational efficiency of VB methods. Subsequently, Yamaguchi and Okada (Reference Yamaguchi and Okada2020a) developed a VB inference algorithm for saturated CDMs. They ingeniously introduced a G-matrix, reformulating existing generalized CDMs, typically parameterized by attribute parameters, into a Bernoulli mixture model. This reformulation facilitated conditionally conjugate priors for model parameters, simplifying the derivation process and enhancing algorithmic efficiency. Oka et al. (Reference Oka, Saso and Okada2023) sustained this trajectory of innovation by developing a VB algorithm for a polytomous-attribute saturated CDM. Their work, building on the foundational research of Yamaguchi and Okada (Reference Yamaguchi and Okada2020a), not only advanced the field but also incorporated parallel computing configuration. This significantly improved the computational efficiency of the VB algorithm, demonstrating its evolving capability to handle more complex CDM structures. Simultaneously, Oka and Okada (Reference Oka and Okada2023) tackled scalability challenges in CDMs by developing an estimation algorithm for the Q-matrix of DINA model. Their approach, integrating stochastic optimization with VI in an iterative algorithm, showcased the adaptability and robustness of VB methods in dealing with large-scale CDMs. This series of developments highlight the ongoing progress and effectiveness of VB methods in the estimation of diverse models within the CDMs framework.
To date, no VB algorithms have been developed to directly estimate the item parameters in the LCDM with a logit link function. This is largely due to the challenges in directly deriving the conditional posterior density of these item parameters. Although Yamaguchi and Okada (Reference Yamaguchi and Okada2020a) proposed the variational EM (VEM) algorithm to estimate the saturated LCDM, they actually used the least-squares transformation method (de la Torre et al., Reference de la Torre2011) to convert the estimates of the item response probability of the item-specific attribute-mastery pattern parameters, obtained through the VEM algorithm, into the corresponding item parameters of the LCDM. Furthermore, Yamaguchi and Templin (Reference Yamaguchi and Templin2022) employed a one-to-one mapping within the Bayesian framework to equivalently transform the item response probability parameters, obtained through the Gibbs sampling algorithm, into item parameters in the LCDM model. This paper effectively bridges this gap by proposing a novel and highly effective variational Bayesian EM-maximization (VBEM-M) algorithm for estimating the saturated LCDM. Briefly, we obtained a tight lower bound on the likelihood function of the LCDM model using Taylor expansion (Jaakkola & Jordan, Reference Jaakkola and Jordan2000), where the item parameters take a quadratic form. This allows for the existence of a conjugate prior distribution, enabling the implementation of the VI method. Consequently, the VI algorithm can be executed in the LCDM by deriving a specific posterior distribution for the item parameters, originating from the Gaussian prior distribution that serves as the conjugate prior for item parameters.
We outline the benefits from the following perspectives to highlight the advantages by which the VBEM-M algorithm excels above the other algorithms. First, our VBEM-M algorithm overcomes the problem of the conditional variational posteriors of the parameters that need to be derived being in the same distributional family as the priors in the implementation of the VI method for the saturated LCDM formulation. Second, the VBEM-M algorithm can directly estimate the item parameters and latent attribute-mastery pattern (also called “attribute profile”) simultaneously, unlike Yamaguchi and Okada’s (Reference Yamaguchi and Okada2020a) VEM algorithm, which requires a two-step process to acquire the estimation of item parameters. Third, the VBEM-M algorithm can obtain a more stable and accurate estimate than an EM algorithm, especially in high-dimensional and small sample size conditions. Finally, our VBEM-M algorithm offers considerable benefits in computing time compared to the time-consuming traditional MCMC algorithms. This is because we use the VI method to transform a posterior inference issue into an optimization problem.
The rest of this paper is organized as follows. Section 2 presents the LCDM and its special case, the DINA model; Section 3 introduces the specific implementation of the VBEM-M algorithm for estimating the LCDM. Section 4 presents three simulation studies that evaluate the performance of the VBEM-M algorithm in parameter recovery across different simulation conditions and compares the performance of the VBEM-M, VB, MCMC, and EM algorithms. Section 5 uses two empirical examples to demonstrate the model estimation results of these four algorithms. Finally, some concluding remarks are presented in Section 6.
2 Cognitive diagnostic models
2.1 Log-linear cognitive diagnostic model
In this study, we focused on the LCDM. This is because it is a general model that contains a large number of models that have been previously discussed, such as DINA, DINO, rRUM, and LLM (Henson et al., Reference Henson, Templin and Willse2009). More importantly, the LCDM can provide a parameterization that not only enables it to characterize the differences between the various models but also offers support for more complex data structures (Henson et al., Reference Henson, Templin and Willse2009). In fact, any possible set of constraints for the saturated form LCDM can be used to define a model that fits the item response in the framework of cognitive theory. Moreover, a better understanding of the relationships between compensatory models and non-compensatory models can be described in the general parametric form. After this, a brief introduction to the LCDM will be given.
First, we define several indices that will be important throughout this paper. Each examinee is denoted by $i~(i=1,\cdots ,N)$ , each item by $j~(j=1,\cdots ,J)$ , each attribute by $k~(k=1,\cdots ,K)$ , and the latent class corresponding to an attribute profile is denoted by $l~(l=1,\cdots ,L)$ . We consider the latent attribute $\alpha _{ik}$ to be a binary variable, where the absence or presence of the corresponding attribute is represented by the values 0 and 1, respectively. $\boldsymbol {\alpha }_{i}=(\alpha _{i1},\cdots ,\alpha _{ik},\cdots ,\alpha _{iK})^{\text {T}}$ is a vector of K-dimensional latent attribute profiles for the ith examinee. In light of the categorical nature of the latent classes, $\boldsymbol {\alpha }_{i}$ belongs to one of $L=2^K$ latent attribute profiles. Defining $\boldsymbol {\widetilde {\alpha }}_{l}=(\widetilde {\alpha }_{l1},\cdots ,\widetilde {\alpha }_{lk},\cdots ,\widetilde {\alpha }_{lK})^{\text {T}}$ as the attribute profile for examinees of class l, where $\widetilde {\alpha }_{lk}$ is 1 if the examinees of class l acquire skill k and 0 otherwise, will be useful in the following.
$\widetilde {\textbf {A}}=(\widetilde {\boldsymbol {\alpha }}_{1},\cdots ,\widetilde {\boldsymbol {\alpha }}_{l},\cdots ,\widetilde {\boldsymbol {\alpha }}_{L})^{\text {T}}$ denotes a matrix of $L \times K$ dimensions containing all the attribute profiles. The $\text {Q}$ -matrix(Tatsuoka, Reference Tatsuoka1983) is a $J \times K $ matrix used to describe the relationship between attributes and items, where $\boldsymbol {q}_j^{\text {T}}=(q_{j1},\cdots ,q_{jk},\cdots ,q_{jK})$ , and $q_{ik}\in \{0,1\}$ is a vector of the jth row of the $\text {Q}$ -matrix; that is, $\textbf {Q}=(\boldsymbol {q}_1,\cdots ,\boldsymbol {q}_j,\cdots ,\boldsymbol {q}_J)^{\text {T}}$ : $q_{ik}=1$ if the attribute k is required by item j, and $q_{ik}=0$ otherwise. Next, a binary latent indicator variable $\boldsymbol {z}_i=[z_{i1},\cdots ,z_{il},\cdots ,z_{iL}]^{\text {T}}$ is introduced, which satisfies $\sum _{l=1}^{L}z_{il}=1$ , where $z_{il}=1$ denotes the ith examinee belonging to the lth attribute profile (i.e., $\boldsymbol {\alpha }_{i}=\widetilde {\boldsymbol {\alpha }}_{l}$ ). Let $x_{ij}$ be the observed item response for the ith examinee to the jth item: $x_{ij}=1$ if the ith examinee gives the correct answer for the jth item, and it is 0 otherwise. The corresponding item response matrix for all examinees answering all items is $\textbf {X}=(\boldsymbol {x}_{1},\cdots ,\boldsymbol {x}_{i},\cdots ,\boldsymbol {x}_{N})^{\text {T}}$ , where $\boldsymbol {x}_{i}=(x_{i1},\cdots ,x_{ij},\cdots ,x_{iJ})^{\text {T}}$ , $i=1,\cdots ,N$ . Then, the probability of a correct response for the LCDM can be expressed as
where $\eta _{j}$ is the intercept parameter, and $\exp (\eta _{j})/(1+\exp (\eta _{j}))$ indicates the probability that an examinee answers correctly on item j if he or she does not master any of the attributes examined on that item. $\boldsymbol {\lambda }_j=(\lambda _{j1},\cdots ,\lambda _{jK},\lambda _{j12},\cdots ,\lambda _{j12\cdots (K-1)K})^{\text {T}}$ is the slope parameter vector, which is composed of a $D \times 1$ vector, where $D=2^K-1$ . $\textbf {h}(\widetilde {\boldsymbol {\alpha }}_{l},\boldsymbol {q}_{j})$ represents a set of linear combinations of $\widetilde {\boldsymbol {\alpha }}_{l}$ and $\boldsymbol {q}_{j}$ :
Combining the latent variable $\boldsymbol {z}_i$ and Eq. (2), the LCDM can be rewritten as
2.2 DINA model
The DINA model, as a special case of the LCDM, has a relatively straightforward structure and widespread adoption in cognitive diagnostic assessments; specialized software packages are also available for a number of estimation techniques grounded in the model. Therefore, we provide a short overview of the traditional DINA model and its interconversion with the LCDM. Two-item parameters have been introduced in the traditional DINA models for each item j: $s_j$ is the slipping parameter and $g_j$ is the guessing parameter, and the probability of a correct response can be written as
where $\gamma _{ij}$ is the ideal response pattern. $\gamma _{lj}=1$ indicates that examinee i possesses all the required attributes for item j; otherwise, $\gamma _{lj}=0$ . The parameters $s_j$ and $g_j$ can be formally defined by
Since the estimation approach presented in this work is based on the LCDM, we must first convert the DINA model to LCDM format. Our next topic is the connection between the DINA model and the LCDM and how they may be converted back and forth.
Let $\widetilde {\boldsymbol {K}}_j=\{k: \text {attribute}~ k ~\text {is measured by item}~ j\}$ denote an indicator set of attributes investigated by item j and $K^*_j$ denote the number of investigated attributes. Then, the DINA model can be rewritten in the form:
where
For simplicity, we denote $\lambda _{j\tilde {K}_{j1}\tilde {K}_{j2}\cdots \tilde {K}_{jK_j^*}}$ as $\lambda _j$ and the DINA model is equivalent to the following form:
While this study focuses mostly on the LCDM, various variants of the LCDM, such as the DINO model, LLM, and saturated LCDM, are also discussed. We will therefore not go into great depth here; instead, the reader should refer to the Supplementary Material for the necessary information.
3 Variational Bayesian EM-maximization algorithm for the LCDM
3.1 Variational Bayesian EM algorithm
Since it is straightforward to convert an approximate conditional posterior distribution problem into an optimization problem using VI methods, these techniques see extensive application in inferring Bayesian models in the area of machine learning (Beal, Reference Beal2003; Bishop, Reference Bishop2006; Jordan et al., Reference Jordan, Ghahramani, Jaakkola and Saul1999). Next, we briefly outline the implementation process of the variational Bayesian EM (VBEM) algorithm (Beal, Reference Beal2003). Assume that the observed dataset $\boldsymbol {y}=(\boldsymbol {y}_1,\cdots ,\boldsymbol {y}_i,\cdots ,\boldsymbol {y}_N)$ is produced by model $\mathcal {M}$ , where model $\mathcal {M}$ consists of the latent variables $\boldsymbol {\zeta }=(\boldsymbol {\zeta }_1,\cdots ,\boldsymbol {\zeta }_{i},\cdots ,\boldsymbol {\zeta }_{N}$ ) and model parameters $\boldsymbol {\theta }$ . Next, we specify a variational density family $\mathcal {Q}$ over the unknown variables $\boldsymbol {\zeta }$ and $\boldsymbol {\theta }$ . The purpose of this is to establish the optimal approximation $q(\boldsymbol {\zeta },\boldsymbol {\theta })\in \mathcal {Q}$ to their posterior distribution using this specified variational density (i.e., $q(\boldsymbol {\zeta },\boldsymbol {\theta })\rightsquigarrow p(\boldsymbol {\zeta },\boldsymbol {\theta }|\boldsymbol {y})$ ). Next, we introduce the concept of the evidence lower bound (ELBO), which is critical for determining the optimal $q(\boldsymbol {\zeta },\boldsymbol {\theta })$ . Let $p(\boldsymbol {y}|\mathcal {M})$ be a marginal density of the model $\mathcal {M}$ ; the ELBO can then be represented as a lower bound of the logarithm marginal density $\log p(\boldsymbol {y}|\mathcal {M})$ :
where $\underline {\mathcal {L}}\left (q(\boldsymbol {\zeta },\boldsymbol {\theta })\right )$ is denoted as the ELBO, which is a function of the free distribution $q(\boldsymbol {\zeta },\boldsymbol {\theta })$ . We need to maximize $\underline {\mathcal {L}}\left (q(\boldsymbol {\zeta },\boldsymbol {\theta })\right )$ with respect to $q(\boldsymbol {\zeta },\boldsymbol {\theta })$ so that it tends more closely to $ \log p(\boldsymbol {y}|\mathcal {M})$ . Blei et al. (Reference Blei, Kucukelbir and McAuliffe2017) presented a formula connecting $ \log p(\boldsymbol {y}|\mathcal {M})$ with the ELBO and the Kullback–Leibler (KL) divergence:
Since $\log p(\boldsymbol {y}|\mathcal {M})$ is a constant with respect to $q(\boldsymbol {\zeta },\boldsymbol {\theta })$ , maximizing the ELBO is actually equivalent to minimizing the KL distance. Specifically, the optimal $q(\boldsymbol {\zeta },\boldsymbol {\theta })$ we obtained in the variational density family $\mathcal {Q}$ is the density that minimizes the KL divergence between the posterior distribution $p(\boldsymbol {\zeta },\boldsymbol {\theta }|\boldsymbol {y})$ and itself. To further simplify the variational density $q(\boldsymbol {\zeta },\boldsymbol {\theta })$ , we assume that it satisfies mean-field theory. Mean-field theory has been widely used in variational Bayesian inference (Beal, Reference Beal2003; Blei et al., Reference Blei, Kucukelbir and McAuliffe2017; Jordan et al., Reference Jordan, Ghahramani, Jaakkola and Saul1999; Wand et al., Reference Wand, Ormerod, Padoan and Frühwirth2011). In the mean-field theory, latent variables are mutually independent and each is governed by a separate factor in the variational density, allowing the variational density $q(\boldsymbol {\zeta },\boldsymbol {\theta })$ to be decomposed into $q(\boldsymbol {\zeta })q(\boldsymbol {\theta })$ . An iterative optimization procedure is implemented by seeking to maximize the mean-field variational density of a parameter of interest while fixing the others. The VB algorithm can be divided into the following two steps:
where $\mathcal {Z}_{\boldsymbol {\zeta }_i}$ and $\mathcal {Z}_{\boldsymbol {\theta }}$ are the normalizing constants. To sum up, the variational density for the latent variable is updated in the VBE step, while the variational density for the model parameters is updated in the VBM step. Therefore, the prerequisite to be able to implement the VBEM algorithm is that the posterior distribution of all parameters, either latent variables or model parameters, should have a closed form. The VEM algorithm proposed by Yamaguchi and Okada (Reference Yamaguchi and Okada2020a, Reference Yamaguchi and Okada2020b) in educational psychometric research is essentially identical to the VBEM algorithm provided by Beal (Reference Beal2003), with the only differences being in nomenclature.
3.2 Variational methods in Bayesian logistic regression
As mentioned above, implementing the VBEM algorithm requires a closed form for the posterior distributions of each parameter. Therefore, the VBEM algorithm cannot be directly applied to the LCDM based on the logit link function. To overcome this challenge, we adopt Jaakkola and Jordan’s (Reference Jaakkola and Jordan2000) variational Bayesian method for logistic regression models to estimate the more complex LCDM in the cognitive diagnostic framework. Specifically, their method uses a Taylor expansion on the logistic function to obtain a tight lower bound, facilitating parameter representation in a Gaussian distribution form that is easily implementable for VI. Next, we will provide the mathematical expression that Jaakkola and Jordan (Reference Jaakkola and Jordan2000) used for performing the first-order Taylor expansion and the specific derivation of the tight lower bound.
Consider the logistic function $\sigma (\omega )=1/\left (1+\exp (-\omega )\right )$ . The corresponding log logistic function can be derived as
Denote that
By calculating the second derivative, we can determine that $f(\omega )$ is a convex function about the variable $\omega ^2$ . Therefore, any tangent line of $f(\omega )$ can serve as its lower bound, as it will always be less than or equal to $f(\omega )$ . A tight lower bound function for $f(\omega )$ can be obtained by executing a first-order Taylor expansion on the function $f(\omega )$ in terms of the variable $\omega ^2$ at the point $\xi ^2$ ,
According to Eqs.(12) and (13), we can derive a tight lower bound of $\sigma (\omega )$ with the specific form as
which results in a quadratic form on $\omega $ .
Regarding the LCDM, which also employs a logistic form, $\omega $ represents a set of linear combinations. These combinations involve unknown item parameters, an individual’s latent attribute vector, and the known $\text {Q}$ -matrix within the LCDM framework (for further details, please refer to Eqs. (1) and (2)). Based on Eq. (14), we can derive a quadratic form for the item parameter. Consequently, the VI algorithm can be implemented in the LCDM by deriving a specific posterior distribution for item parameters using the Gaussian prior distribution, which serves as the conjugate prior for these parameters. In the next subsection, we will focus on elucidating the process of deriving the tight lower bound in the LCDM using Eq. (14).
3.3 Tight lower bound for the LCDM
In this section, the goal is to derive the tight lower bound for LCDM as outlined above. We first conducted a transformation on the item response data in the LCDM to make it easier to acquire the tight lower bound term of the likelihood function before providing the implementation of our VBEM-M algorithm. The item response data $x_{ij}=\{0,1\}$ is transformed into $y_{ij}=\{-1,1\}$ with the help of the equation $y_{ij}=2x_{ij}-1$ . Let $\boldsymbol {\lambda }^*_j=(\eta _j,\boldsymbol {\lambda }_j^{\text {T}})^{\text {T}}$ and $\boldsymbol {h}_{jl}^*=(1,h(\boldsymbol {\tilde {\alpha }}_{l},\boldsymbol {q}_j))$ ; the item response probability of $y_{ij}$ is then given by
Recalling the logistic function form, the item response probability of $y_{ij}$ can then be rewritten as follows:
Therefore, the likelihood based on the introduced latent variable z can be represented by
According to Eq. (14), the tight lower bound function for $\sigma (y_{ij}\boldsymbol {\lambda }_j^{*\text {T}}\boldsymbol {h}^*_{jl})$ is determined by performing a first-order Taylor expansion with respect to the variable $(y_{ij}\boldsymbol {\lambda }_j^{*\text {T}}\boldsymbol {h}^*_{jl})^2$ at the point $\xi _{ijl}^2$ . Therefore, a tight lower bound of the likelihood for the LCDM can be derived by:
Given that the tight lower bound for the likelihood function is of exponential form, using the multivariate normal distribution as a conjugate prior distribution for $\boldsymbol {\lambda }^*$ will yield a closed-form posterior distribution. Due to these considerations, in the subsequent computations, we implement the VBEM-M algorithm using the tight lower bound of the likelihood function rather than the original likelihood function. Moreover, it is important to highlight that a new local parameter, $\xi _{ijl}$ , has been introduced at this stage. Determining the optimal value for $\xi _{ijl}$ is an essential part of our analysis. In this paper, we implement a maximization process to ascertain the most suitable value for $\xi _{ijl}$ . The detailed methodology behind this process will be elaborated in the following subsection.
3.4 Fully Bayesian representation of the joint posterior distribution
In the fully Bayesian framework, statistical inference relies on the selection of the prior distribution. The posterior distribution can be derived by combining the prior distribution (prior information) with the likelihood function (sample information). Prior distributions from the following Bayesian hierarchical structures will be considered in this study:
where $\textbf {I}_{D+1}$ is a $(D+1)$ -dimensional identity matrix. Parameter c is a truncation parameter. Some literature restricts the main effect terms of $\boldsymbol {\lambda }$ to non-negative values (Zhan et al., Reference Zhan, Jiao, Man and Wang2019). To address this, a truncation parameter c is introduced to adjust the range of values for the prior parameter $\lambda _{0,main}$ . For example, when c is set to $-\infty $ , there is no restriction on $\lambda _{0,main}$ , while setting $c=0$ restricts $\lambda _{0,main}$ to non-negative values. In practice, users can adjust the value of c to restrict the range of $\lambda _{0,main}$ according to their specific requirements. Let $\boldsymbol {\Omega }=(\boldsymbol {\delta }_0,\mu _{\eta },\sigma ^2_{\eta },\mu _{\lambda },\sigma ^2_{\lambda })$ , the joint posterior distribution of $(\textbf {Y},\boldsymbol {z},\boldsymbol {\pi },\boldsymbol {\lambda }^*,\boldsymbol {\lambda }^*_{0} | \textbf {Q},\boldsymbol {\Omega },\widetilde {\textbf {A}})$ based on the tight lower bound can be represented by
where $const$ denotes a constant. The logarithm of $\underline {p(\textbf {Y},\boldsymbol {z},\boldsymbol {\pi },\boldsymbol {\lambda }^*,\boldsymbol {\lambda }^*_{0} | \textbf {Q},\boldsymbol {\Omega },\widetilde {\textbf {A}})} $ can be further expressed as
3.5 Implementation of VBEM-M algorithm for LCDM
Assuming that the joint variational density of $(\boldsymbol {z},\boldsymbol {\pi },\boldsymbol {\lambda }^*,\boldsymbol {\lambda }^*_0)$ for the LCDM satisfies mean-field theory, the following equation holds:
Let $\varTheta =(\boldsymbol {z},\boldsymbol {\pi },\boldsymbol {\lambda }^*,\boldsymbol {\lambda }^*_0)$ ; in terms of Eqs. (9) and (21), the ELBO $\underline {\mathcal {L}}(q(\varTheta ))$ can then be derived as
where $\underline {\mathcal {L}}^*(q(\Theta ),\boldsymbol {\xi })$ is a tight lower bound of $\underline {\mathcal {L}}(q(\Theta ))$ . Next, we maximize $\underline {\mathcal {L}}^*(q(\Theta ),\boldsymbol {\xi })$ to obtain estimates of latent variables $\boldsymbol {z}$ , model parameters $(\boldsymbol {\pi }, \boldsymbol {\lambda }^*)$ , hyperparameters $\boldsymbol {\lambda }_0^*$ , and local point parameter $\boldsymbol {\xi }$ . Specifically, there are three steps to the implementation process:
-
(a) VBE step: update variational density for latent variable;
-
(b) VBM step: update variational densities for model parameters and hyperparameters;
-
(c) M step: update local point parameter $\boldsymbol {\xi }$ by maximizing $\underline {\mathcal {L}}^*(q(\Theta ),\boldsymbol {\xi })$ .
In the following text, $q^*(\cdot )$ denotes the optimal variational posterior in each iteration. To keep things simple, we only present the core formulation for updating. The specifics can be found in the Supplementary Material. The estimation procedure of the VBEM-M algorithm is shown in Table 1. In Table 1 and subsequent tables, all parameters are estimated using their posterior means. In addition, the specific implementation process of each step for the VBEM-M algorithm is shown in Figure 1.
(a) VBE step. In this step, we update the variational density of $\boldsymbol {z}_i$ for each i, where $i=1,\cdots ,N$ . $q^*(\boldsymbol {z}_i)$ is derived to be a categorical distribution with parameter $\boldsymbol {\pi }^*_i$ . That is,
where
(b) VBM step. In this step, we update the variational density for $\boldsymbol {\pi }$ , $\boldsymbol {\lambda }^*_j\,(j=1,\cdots ,J)$ , $\eta _0$ , $\lambda _{0,main}$ and $\lambda _{0,inter}.$
(b1) Update the variational density for $\boldsymbol {\pi }$
$q^*(\boldsymbol {\pi })$ is derived to be a Dirichlet distribution with parameter $\boldsymbol {\delta }^*$ . That is,
where
(b2) Update the variational density for $\boldsymbol {\lambda }^*_j$
$q(\boldsymbol {\lambda }^*_j)$ is proportional to a multivariate normal distribution with mean vector $\boldsymbol {m}_{j}^*$ and covariance $\boldsymbol {V}_j^*$ . That is,
where
(b3) Update the variational density for $\mathbf {\eta _0}$
$q^*(\eta _0)$ is proportional to a normal distribution with mean $\mu _{\eta _0}^*$ and variance $\sigma _{\eta _0}^{*2}$ . That is,
where
where $\text {E}_{q(\boldsymbol {\lambda }_j^*)}[\boldsymbol {\lambda }_j^*]_{\eta }$ is the corresponding expected value of the element $\eta _j$ in the vector $\boldsymbol {\lambda }^*$ .
(b4) Update the variational density for $\mathbf {\lambda _{0,main}}$
$q^*(\lambda _{0,main})$ is proportional to a truncated normal distribution with mean $\mu _{\lambda _{0,main}}^*$ and variance $\sigma _{\lambda _{0,main}}^{*2}$ . Specifically,
where
where $J^*_{main}$ denotes the number of all main effect terms, $J_d=\{j: \lambda _{jd}\neq 0 \}$ , and $\text {E}_{q(\boldsymbol {\lambda }_j^*)}[\boldsymbol {\lambda }_j^*]_{\lambda _d}$ is the corresponding expected value of the element $\lambda _{jd}$ in the vector $\boldsymbol {\lambda }^*$ .
(b5) Update the variational density for $\mathbf {\lambda _{0,inter}}$
$q^*(\lambda _{0,inter})$ is proportional to a truncated normal distribution with mean $\mu _{\lambda _{0,inter}}^*$ and variance $\sigma _{\lambda _{0,inter}}^{*2}$ . Specifically,
where
where $J^*_{inter}$ denotes the number of all interaction terms.
(c) M step. In this step, we update the local point parameter $\xi _{ijl}\,~(i=1,\cdots ,N;\;j=1,\cdots ,J;\;l=1,\cdots ,L)$ . To obtain the optimal $\xi _{ijl}$ , we need to maximize $\underline {\mathcal {L}}^*(q(\Theta ),\boldsymbol {\xi })$ by computing the derivative of $\xi _{ijl}$ to zero:
Therefore, we have
Considering the aforementioned presentation of the VBEM-M algorithm, it is clear that we need to compute a large number of expectations using categorical, Dirichlet, normal, multivariate normal, and truncated normal distributions. Some formulae for calculating these expectations are as follows:
where $\psi (\cdot )$ is $\psi (x)=\frac {\mathrm {d}}{\mathrm {d}x}\log \Gamma (x)$ , $\Gamma (x)=\int _0^{\infty } t^{(x-1)}\exp (-t)\, dt $ , $\phi (\cdot )$ is the density function of a standard normal distribution, and $\Phi (\cdot )$ is the cumulative distribution function of a standard normal distribution.
4 Simulation study
In the following simulation studies, we address three primary concerns: First, the performance of the VBEM-M algorithm under various conditions for the DINA model; second, the performance of the VBEM-M algorithm, based on the DINA model, compares to Yamaguchi and Okada’s (Reference Yamaguchi and Okada2020b) VB method, the MCMC algorithms within the full Bayesian framework, and the EM algorithm in the frequency framework under different simulation settings; third, the performance of the VBEM-M algorithm is compared with the VB, MCMC, and EM algorithms under the saturated LCDM with different simulation conditions. Supplementary Material showcases the performance of the VBEM-M algorithm for the DINA model under different initial values and in other widely used CDMs, including the DINO model and LLM.
4.1 Data generation
Item response data $x_{ij}$ is generated from a Bernoulli distribution with probability of correct response $P(x_{ij}=1|\boldsymbol {\alpha }_i,\boldsymbol {\lambda }^{*},\boldsymbol {q}_j)$ . The true values of the item parameters based on DINA model are constrained by considering four different levels of noise to investigate the correlation between noise and recovery. For each item, the following scenarios are considered. (a1) Low noise level (LNL): $s_j=g_j=0.1$ , with corresponding true values $\eta _j= -2.1972$ , $\lambda _j=4.3944$ . (a2) High noise level (HNL): $s_j=g_j=0.2$ , with corresponding true values $\eta _j= -1.3863$ , $\lambda _j=2.7726$ . (a3) Slipping higher than guessing (SHG): $s_j=0.2$ , $g_j=0.1$ , with corresponding true values $\eta _j= -2.1972$ , $\lambda _j=3.5835$ . (a4) Guessing higher than slipping (GHS): $s_j=0.1$ , $g_j=0.2$ , with corresponding true values $\eta _j= -1.3863$ , $\lambda _j=3.5835$ .
To generate the attribute-mastery patterns, we used the same procedure as Chiu and Douglas (Reference Chiu and Douglas2013), which takes into account the correlations among the attributes. Specifically, $\boldsymbol {\alpha }^*_{i}=(\alpha ^*_{i1},\cdots ,\alpha ^*_{ik},\cdots ,\alpha ^*_{iK})^{\text {T}}$ are generated from a multivariate normal distribution; that is, $\boldsymbol {\alpha }^*_i\sim \text {N}(\boldsymbol {0}_K,\boldsymbol {\Sigma }_{K\times K})$ , where $\boldsymbol {0_K}=(0,\dots ,0)^{\text {T}}_{K\times 1}$ and
where the off-diagonal elements of $\Sigma _{K\times K}$ are $\sigma $ . As $\sigma $ increases from 0 to 1, the correlation between attributes also increases from 0 to maximum. The relationships between the attribute profiles $\boldsymbol {\alpha }_i$ and $\boldsymbol {\alpha }^*_i$ can be expressed as $\alpha _{ik}=1$ if $\alpha _{ik}^*>0$ and $\alpha _{ik}=0$ otherwise. Although the $\text {Q}$ -matrices are created randomly, they still conform to the identifiability constraints outlined by Chen et al. (Reference Chen, Liu, Xu and Ying2015, Reference Chen, Culpepper, Chen and Douglas2017), Liu and Andersson (Reference Liu, Andersson and Skrondal2020), and Xu and Shang (Reference Xu and Shang2018). We present the $\text {Q}$ -matrices used in these simulations in the Supplementary Material.
4.2 Prior distributions
The prior parameter $\boldsymbol {\delta }_0$ is set as $\boldsymbol {\delta }_0=\boldsymbol {1}_L$ (Culpepper, Reference Culpepper2015; Zhan et al., Reference Zhan, Jiao, Man and Wang2019), where $\boldsymbol {1}_L$ denotes a L-dimensional vector with all elements equal to 1. The hyperparameters are chosen as follows: $\mu _{\eta _0}=-2$ , $\mu _{\lambda _{main}}=\mu _{\lambda _{inter}}=0$ , and $\sigma ^2_{\eta _0}=\sigma ^2_{\lambda _{main}}=\sigma ^2_{\lambda _{inter}}=10$ .
4.3 Estimation software
We implemented four different approaches, namely, the VBEM-M algorithm, VB algorithm, MCMC sampling algorithm, and EM algorithm, using the R programming language (R Core Team, 2017) on a desktop computer equipped with Intel (R) Core (TM) i5-10400 CPU @ 2.90GHz, 16GB RAM. To enhance the computational efficiency of the VBEM-M method, we utilized two R packages, “Rcpp” (Eddelbuettel & Francois, Reference Eddelbuettel and Francois2011) and “RcppArmadillo” (Eddelbuettel & Sanderson, Reference Eddelbuettel and Sanderson2014), to call the C++ programming language. The R code of our VBEM-M algorithm can be found in the Supplementary Material. We used the R package “variationalDCM” (Hijikata et al., Reference Hijikata, Oka, Yamaguchi and Okada2023) to implement the VB method. The MCMC sampling algorithms were implemented separately using the R packages “dina” (Culpepper & Balamuta, Reference Culpepper and Balamuta2019), which is integrated with the C++ program, and “R2jags” (Su & Yajima, Reference Su and Yajima2015) which is associated with the JAGS program (Plummer, Reference Plummer2003). The EM algorithm was implemented using the R packages “GDINA” (Ma & de la Torre, Reference Ma and de la Torre2020) and “CDM” (George et al., Reference George, Robitzsch, Kiefer, Groß and Ünlü2016), respectively.
4.4 Convergence diagnosis
The VBEM-M algorithm was considered converged if the absolute difference between two consecutive iterations was less than $e_0=10^{-4}$ , or if the number of iterations T had reached 2,000. When using the R packages “dina” and “R2jags” to implement the MCMC sampling algorithms, for the DINA model, Culpepper (Reference Culpepper2015) demonstrated that it would have converged after 750 iterations, thus the chain length was set to 2,000 and the first 1,000 iterations were set as a “burn-in” period. For the saturated LCDM, we chose a chain length of 10,000, with a burn-in of 5,000. For the EM algorithm, when employing the R package “GDINA,” the convergence criteria is when the maximum absolute change in item success probabilities between consecutive iterations was smaller than $e_0=10^{-4}$ or when T exceeded 2,000. In addition, when using the R package “CDM,” iteration will end if the maximal change in parameter estimates is below $e_0=0.001$ .
4.5 Evaluation Criteria
For item parameters and class membership probability parameters, we assess the accuracy of parameter estimation using bias and root mean square error (RMSE). For attribute parameters, we adopt the following two evaluation indices: the pattern-wise agreement rate (PAR), which indicates the rates of correct classification for attribute patterns, and is formulated as
and the attribute-wise agreement rate (AAR), which signifies the rates of correct classification for individual attributes, and is defined as
where $\boldsymbol {\alpha }_i$ is the true value of the ith student’s attribute profile and $\hat {\boldsymbol {\alpha }}_i$ is the estimated value of $\boldsymbol {\alpha }_i$ . $\hat {\alpha }_{ik}$ is the estimated value of $\alpha _{ik}$ for the specific attribute k.
4.6 Simulation study 1
In this simulation study, we explored the performance of the VBEM-M algorithm under various simulation conditions. We set the test length to $J=30$ , the number of attributes was set to $K=5$ , and the corresponding $\text {Q}$ -matrix is shown in the Supplementary Material. The following manipulated conditions were considered: (A) number of examinees $N=1,000$ and 2,000; (B) correlation among attributes $\sigma =0$ , 0.3 and 0.7; and (C) noise levels LNL, HNL, SHG, and GHS. Fully crossing different levels of these three factors yields 24 conditions (2 sample sizes $\times $ 3 correlations $\times $ 4 noise levels). There were 100 replications for each simulation condition. The recovery results of parameters are displayed in Tables 2 and 3 and Figure 3.
Note: The values outside parentheses represent the RMSE, while the values inside the parentheses indicate the bias. These reflect the average RMSE, bias and SD for all intercept parameters $\boldsymbol {\eta }$ , slope parameters $\boldsymbol {\lambda }$ , and class membership probability parameters $\boldsymbol {\pi }$ .
Note: AAR1 represents the correct classification rate for the first attribute, AAR2 for the second attribute, AAR3 for the third attribute, AAR4 for the fourth attribute, and AAR5 for the fifth attribute. PAR stands for the pattern-wise agreement rate.
The following conclusions can be drawn from Tables 2 and 3. (1) Given the correlation and noise levels, when the number of examinees is increased from 1,000 to 2,000, the average RMSE, the average bias, and standard deviation (SD) for η , λ , and π show decreasing trends. For example, when the correlation among attributes is 0.3 and the LNL is applied, increasing the number of examinees from 1,000 to 2,000 results in the average bias of η decreasing from -0.0140 to -0.0077, and the average bias of λ decreasing from 0.0307 to 0.0133. The average RMSE of η decreases from 0.1369 to 0.0981, the average RMSE of λ from 0.2337 to 0.1669, and the average RMSE of π from 0.0022 to 0.0016. The SD of η decreases from 0.0937 to 0.0664, the SD of λ decreases from 0.1617 to 0.1152, and the SD of π decreases from 0.0051 to 0.0037. (2) When the number of examinees and the noise level are given, with increasing $\sigma $ , the average RMSE for $\boldsymbol {\eta }$ increase somewhat. This indicates that $\boldsymbol {\eta }$ is less impacted by the correlation between attributes. $\boldsymbol {\lambda }$ is substantially more impacted by $\sigma $ ; specifically, the average bias and RMSE for $\boldsymbol {\lambda }$ tend to decrease markedly as $\boldsymbol {\sigma }$ increases. In the meanwhile, RMSE for $\boldsymbol {\pi }$ also tend to decrease as $\boldsymbol {\sigma }$ increases. For example, when the number of examinees is fixed at 1,000 and the LNL noise level is applied, the average bias are –0.0118, 0.140, 0.104, respectively, and the average RMSE rises from 0.1351 to 0.1388 when $\sigma $ increases. The change in bias and RMSE of $\boldsymbol {\eta }$ are found to be slight. However, the decreases in bias and RMSE are markedly greater for $\boldsymbol {\lambda }$ , with the average bias of $\boldsymbol {\lambda }$ decreasing from 0.0365 to 0.0279 and the corresponding average RMSE decreasing from 0.2560 to 0.2216. For $\boldsymbol {\pi }$ , the average bias remains at 0.0000 in all conditions, while the average RMSE exhibits the largest change in the HNL condition, decreasing from 0.0058 to 0.0048. (3) The accuracy of attribute profile recovery is highest under the LNL condition because the noise is the lowest. For example, with a fixed number of examinees at 1,000 and a correlation of $\sigma =0$ , the PAR is 0.9025 under the LNL condition and only 0.6736 under the HNL condition. Under the LNL condition, the AAR values for five attributes exceed 0.9667 across various sample sizes and levels of attribute correlation. Moreover, the accuracy of attribute profile recovery tends to improve as $\sigma $ increases.
In Figure 3, as an explanation, we only show the recovery results for the LNL and HNL based on the sample size $N=1,000$ . On each item, the bias of $\boldsymbol {\eta }$ are almost the same for the LNL and the HNL. Furthermore, when the correlation between attributes is strengthened ( $\sigma $ from 0 to 0.7), there is no difference between the bias and RMSE of $\boldsymbol {\eta }$ in the LNL (HNL). It was also discovered that, for both low and high levels of noise, the RMSE of $\boldsymbol {\eta }$ is lower when the items evaluate more attributes. At low noise levels, for instance, the RMSE of $\boldsymbol {\eta }$ for the first item evaluating one attribute is greater than that for the eleventh item evaluating the first three attributes together. For $\boldsymbol {\lambda }$ , although the bias of $\boldsymbol {\lambda }$ differs on each item at low and high noise levels, the values of bias are basically around 0. Similarly, for both low and high levels of noise, the RMSE of $\boldsymbol {\lambda }$ is lower when items have higher correlation among themselves. This is because as the attribute correlation increases, more accurate estimates of ${\alpha }$ are obtained, which in turn enhances the accuracy of ${\lambda }$ estimates. This also provides an empirical guarantee for our later practical research. That is, when designing the items, we should aim to achieve higher correlations between attributes to increase the accuracy of parameter estimation.
Additionally, we assess the performance of the VBEM-M algorithm under different initial values (please see the Supplementary Material for details), and the results showed that our VBEM-M algorithm is not affected by the different initial values.
4.7 Simulation study 2
The purpose of this simulation study is to compare the proposed method with Yamaguchi and Okada’s (Reference Yamaguchi and Okada2020b) VB method, the MCMC sampling algorithms, and the EM algorithm in terms of parameter accuracy for the DINA model. Specifically, the R package “variationalDCM” was used to implement Yamaguchi and Okada’s (Reference Yamaguchi and Okada2020b) VB method, while the R packages “dina” and “R2jags” were used to implement the MCMC sampling algorithms. The EM algorithm was implemented using the R packages “GDINA” and “CDM”.
The simulation design is as follows: the test length was fixed at $J=30$ , and the number of attributes was set to $K=5$ . The varying conditions of the simulation are as follows: (D) The number of examinees $N=200$ , 500, 1,000, and 2,000; (E) correlation among attributes $\sigma =0$ , 0.3, and 0.7; and (F) LNL and HNL conditions. Fully crossing different levels of these two factors yields 24 conditions (4 sample sizes $\times $ 3 correlations $\times $ 2 noise levels). Each simulation condition was replicated 100 times. The recovery results of item parameters and attribute profile recovery for all six methods are shown in Tables 4 and 5. Due to the space limit, we only present the results with the correlation $\sigma =0.3$ in Tables 4 and 5; the other two correlation cases ( $\sigma =0$ and $\sigma =0.7$ ) are given in the Supplementary Material. Figure 2 depicts the boxplots of the bias and RMSE for $\boldsymbol {\eta }$ , $\boldsymbol {\lambda }$ , and $\boldsymbol {\pi }$ estimated by the six methods with $\sigma =0.3$ under the LNL condition. Table 6 shows the computation time for these six methods under the same conditions. Here, the displayed computation time is the average time across 100 replications.
Note: The values outside the parentheses represent the RMSE, while the values inside the parentheses indicate the bias. Here, RMSE and Bias denote the average RMSE and Bias, respectively, for all intercept parameters $\boldsymbol {\eta }$ , all slope parameters $\boldsymbol {\lambda }$ and all class membership probability parameters $\boldsymbol {\pi }$ .
Note: AAR1 represents the correct classification rate for the first attribute, AAR2 for the second attribute, AAR3 for the third attribute, AAR4 for the fourth attribute, and AAR5 for the fifth attribute. PAR stands for the pattern-wise agreement rate.
In Tables 4 and 5, as well as in the subsequent simulation studies, the RMSE and bias mentioned are the average RMSE and average bias. From Tables 4 and 5, we can draw the following conclusions: (1) The VBEM-M algorithm consistently outperforms the other five methods in terms of achieving lower RMSE values for item parameters $\boldsymbol {\eta }$ and $\boldsymbol {\lambda }$ under all four sample sizes, regardless of LNL or HNL condition. (2) For the EM algorithm, both EM-GDINA and EM-CDM methods have higher bias and RMSE for item parameters $\boldsymbol {\eta }$ and $\boldsymbol {\lambda }$ than four other methods, especially for a small sample size of N=200, under both LNL and HNL conditions. (3) With the same sample size and noise level, both MCMC methods (MCMC-dina and MCMC-R2jags) show similar estimation accuracy, as do the two EM methods (EM-GDINA and EM-CDM). (4) For parameter $\boldsymbol {\pi }$ , the estimated bias and RMSE of the six methods are basically the same under various identical simulation conditions, with no significant differences. (5) In terms of the accuracy of attribute profile recovery, the results of the six methods are essentially the same under each simulation condition.
From Table 6, we can see that the VBEM-M algorithm is highly efficient in terms of computation time. It performs faster than the VB method across most simulation conditions, and this speed advantage is more noticeable as sample sizes increase. Overall, the computation speed of the VBEM-M algorithm is second only to the two EM algorithms, i.e., EM-GDINA and EM-CDM. The two Bayesian methods, MCMC-dina and MCMC-R2jags, have longer computation time than the other four methods. Additionally, MCMC-dina is faster than MCMC-R2jags due to its use of the “Rcpp” and “RcppArmadillo” packages, which are built on C++ programming language.
4.8 Simulation study 3
This simulation study aims to evaluate the effectiveness of the VBEM-M algorithm on the saturated LCDM by comparing it with Yamaguchi and Okada’s (Reference Yamaguchi and Okada2020a) VB method, the MCMC sampling algorithms, and the EM algorithm. Specifically, the R package “variationalDCM” was used to implement Yamaguchi and Okada’s (Reference Yamaguchi and Okada2020a) VB method, the R package “R2jags” was used to implement the MCMC sampling algorithms, and the EM algorithm was implemented using the R package “GDINA”.
This simulation was designed with an attribute number of $K=3$ and a test length of $J=18$ . In the saturated LCDM, each item’s $\boldsymbol {\lambda }_j^*$ is an 8-dimensional vector ( $2^3=8$ ). The true values of $\boldsymbol {\lambda }^*$ are shown in Table 7. We conducted simulations across different sample sizes (N=1,000, 2,000) and attribute correlations ( $\sigma =0, 0.3, 0.7$ ), resulting in six different conditions. Each condition was replicated 100 times. Notably, an additional calculation procedure was needed for Yamaguchi and Okada’s (Reference Yamaguchi and Okada2020a) VB method, as the R package “variationalDCM” only reports the correct response probabilities for different attribute mastery patterns. We transformed these probabilities into LCDM parameters by solving a linear system of equations (Liu & Johnson, Reference Liu, Johnson, von Davier and Lee2019; Yamaguchi & Templin, Reference Yamaguchi and Templin2022). The parameter recovery results for the $\sigma =0.3$ condition are displayed in Tables 8 and 9. The estimation results for $\sigma =0$ and 0.7 are available in the Supplementary Material.
Note: The values outside the parentheses represent the RMSE, while the values inside the parentheses indicate the bias. Here, RMSE and Bias denote the average RMSE and Bias, respectively, for all intercept parameters $\boldsymbol {\eta }$ , all main effect slope parameters $\boldsymbol {\lambda }_{main}$ , all interaction slope parameters $\boldsymbol {\lambda }_{inter}$ and all class membership probability parameters $\boldsymbol {\pi }$ .
Note: AAR1 represents the correct classification rate for the first attribute, AAR2 for the second attribute, AAR3 for the third attribute. PAR stands for the pattern-wise agreement rate.
For a more detailed analysis, we split the parameter $\boldsymbol {\lambda }$ into two parts: $\boldsymbol {\lambda }_{main}$ (i.e., $\boldsymbol {\lambda }_1,\boldsymbol {\lambda }_2,\boldsymbol {\lambda }_3$ ) and $\boldsymbol {\lambda }_{inter}$ (i.e., $\boldsymbol {\lambda }_{12},\boldsymbol {\lambda }_{13},\boldsymbol {\lambda }_{23},\boldsymbol {\lambda }_{123}$ ), which represent the main effects and interactions, respectively. From the results, we can draw the following conclusions: (1) As the number of examinees increases, the RMSE for item parameters of all algorithms decreases. (2) The proposed VBEM-M algorithm performs better than other algorithms on all item parameters across all conditions, especially on the interactions. Specifically, in terms of the parameters $\boldsymbol {\eta }$ and $\boldsymbol {\pi }$ , VBEM-M has a slight advantage over the other algorithms, whereas it shows a significant advantage in estimating $\boldsymbol {\lambda }_{main}$ and $\boldsymbol {\lambda }_{inter}$ , particularly for the parameter $\boldsymbol {\lambda }_{inter}$ . On the other hand, the EM algorithm performs poorly with small sample sizes. For the $\boldsymbol {\lambda }_{inter}$ parameter, its RMSE exceeds 2 when $N=200$ . (3) Compared to other algorithms, VBEM-M performs significantly better with small sample sizes ${(}N=200, 500$ ), with noticeably lower RMSE. (4) It is worth noting that the results from all algorithms indicate that, although the interaction terms have smaller true values compared to the main effects, their estimation accuracy is worse. This suggests that estimating interaction effects is the most challenging aspect of the saturated LCDM model. (5) As for the accuracy of attribute profiles, there is no obvious difference among these algorithms, but VBEM-M still shows slightly higher accuracy than the others.
Table 10 shows the average computation time across 100 replications for the four algorithms under the $\sigma =0.3$ condition. The results indicate that our algorithm performs better than the other algorithms in terms of computational efficiency. Additionally, an interesting observation is that the EM algorithm takes the longest time when the sample size is small $(N = 200)$ . This suggests that the EM algorithm converges more slowly with smaller sample sizes.
5 Empirical example
5.1 Empirical Example 1
In this example, a fraction subtraction test dataset (de la Torre & Douglas, Reference de la Torre and Douglas2004; Tatsuoka, Reference Tatsuoka, Frederiksen, Glaser, Lesgold and Shafto1990, Reference Tatsuoka2002) was investigated using the DINA model. The VBEM-M algorithm, VB algorithm (implemented in the “variationalDCM” package), MCMC sampling technique (implemented in the “dina” package), and EM algorithm (implemented in the “GDINA” package) were used for the parameter estimation of the DINA model. This test involves 2144 middle school students responding to 15 fraction subtraction items, including five measured attributes: subtract basic fractions, reduce and simplify, separate whole from fraction, borrow from whole, and convert whole to fraction; 536 of 2144 students were chosen for this study (Zhang et al., Reference Zhang, Zhang, Lu and Tao2020). The corresponding $\text {Q}$ -matrix, parameter estimates, and SDs are shown in Table 11.
Note: The values outside the parentheses represent the posterior means of the parameters, while the values inside the parentheses indicate the standard deviation.
To facilitate the following item analysis, we transformed the estimates of the intercept and interaction parameters into the traditional estimates of slipping and guessing parameters, as shown in Table 11. Additionally, the comparison of the parameter estimates among the four algorithms can be found in the Supplementary Material. Based on Table 11, we found that the estimates of the five items with the lowest slipping are items 3, 8, 9, 10, and 7, in that order. The estimated values of the slipping parameters for the five items are 0.0395, 0.0480, 0.0652, 0.0664, and 0.0773, respectively. This demonstrates that the five items are less likely to slip than the other ten items. Furthermore, the five items with the highest guessing are items 2, 10, 8, 5, and 13, in that order. For these five items, the estimated guessing parameters are 0.2035, 0.1658, 0.1417, 0.1307, and 0.1293, respectively. Moreover, items 3, 8, and 10 have low slipping parameters and high guessing parameters, indicating that these items are more likely to be correctly guessed. It is worth noting that there is an interesting observation regarding the results for item 1: since $g_1$ is very small and $s_1$ is very large, it is difficult for students who do not master the first attribute to get a correct response by guessing (the probability of a correct response is lower than 0.0200), and even if they do master the first attribute, the probability of a correct response is still only about 0.7000 due to the possibility of slipping.
Based on the results in Table S11 in the Supplementary Material, we investigated the relationship between the VBEM-M algorithm and the other three algorithms in parameter estimation by analyzing the correlations of parameters s and g across these algorithms. The correlations between s estimates from the VBEM-M and VB algorithms are 0.9984, between VBEM-M and MCMC algorithms is 0.9979, and between VBEM-M and EM algorithms is 0.9989. The correlations between the estimators of g calculated using the VBEM-M algorithm and those obtained from the VB, MCMC, and EM algorithms are 0.9488, 0.9552, and 0.8632, respectively. These findings suggest that the VBEM-M algorithm’s parameter estimates align more closely with those from VB and MCMC algorithms, as indicated by the high correlations. In addition, the estimators of the mixing proportions of attribute-mastery patterns, π^ l for l = 1, ⋯ , 25 = 32, are presented in Figure S2 in the Supplementary Material. Notably, these estimates are highly consistent across the VBEM-M algorithm, VB algorithm, MCMC sampling technique, and EM algorithm. A total of 67% of the examinees were classified into the following four attribute profiles: (1,1,1,0,0), (1,1,1,1,0), (1,1,1,0,1), and (1,1,1,1,1). This suggests that a majority of students have mastered the first three attributes. The computation time for the VBEM-M, VB, MCMC, and EM algorithms were 0.1651 s, 0.1661 s, 11.3820 s, and 0.2870 s, respectively.
5.2 Empirical Example 2
In this section, we analyze the Examination for the Certificate of Proficiency in English (ECPE) dataset based on the LCDM. The ECPE has been widely used in previous research based on the LCDM (e.g., Liu & Johnson, Reference Liu, Johnson, von Davier and Lee2019; Templin & Bradshaw, Reference Templin and Bradshaw2014; Templin & Hoffman, Reference Templin and Hoffman2013; von Davier, Reference von Davier2014b), and it includes 0-1 response data from 2,922 examinees on 28 items. Three attributes are measured: morphosyntactic rules, cohesive rules, and lexical rules. Nine of the 28 items measure two attributes, and the others measure one. The VBEM-M algorithm, VB algorithm (implemented in the “variationalDCM” package), MCMC algotithm (implemented in the “R2jags” package), and EM algorithm (implemented in the “GDINA” package) were used for the parameter estimation of the LCDM model. However, due to space limitations, we only present the estimation results of the VBEM-M method in Tables 12 and 13. The results of the other algorithms can be found in the Supplementary Material.
Note: The values outside the parentheses represent the posterior means of the parameters, while the values inside the parentheses indicate the standard deviation.
The outcomes of the VBEM-M algorithm were more similar to those of the VB algorithm and the MCMC algorithm. Please refer to the Supplementary Material for more details. From Table 12, we found that the estimates of the interaction terms are relatively smaller compared to the main effects, indicating that the main effects have a greater influence on the probability of a correct response. Additionally, most of the interaction effects are positive, suggesting that the interactions between skills are more likely to positively affect the probability of a correct response. Furthermore, from the estimates of $\boldsymbol {\pi }$ in Table 13, we can observe that the most prevalent attribute mastery patterns are (0, 0, 0), (0, 0, 1), (0, 1, 1), and (1, 1, 1). This suggests a possible linear hierarchy structure among the skills. Specifically, mastering lexical rules requires mastering cohesive rules first, and mastering morphosyntactic rules is a prerequisite for mastering cohesive rules. This finding is consistent with previous research conclusions (Gierl, Cui, et al., Reference Gierl, Cui and Hunka2007; Gierl, Leighton, et al., Reference Gierl, Leighton, Hunka, Leighton and Gierl2007).
6 Discussion
In this paper, we propose the novel VBEM-M algorithm for estimating the parameters of the LCDM, which offers fast execution and excellent estimation accuracy. While Yamaguchi and Okada (Reference Yamaguchi and Okada2020a) introduced a VB method for estimating LCDM parameters, their approach primarily focuses on estimating the probability of correct item responses for specific attribute-mastery patterns, without directly estimating the item parameters. In contrast, our VBEM-M algorithm can simultaneously and directly estimate both attribute-mastery patterns and item parameters.
Since the posterior distributions of the item parameters in the LCDM do not have closed forms, it is difficult to execute parameter estimation using the classic VBEM algorithm. To get around this problem, in our approach, the likelihood function for the LCDM is replaced with a tight lower bound obtained by Taylor expansion, and inference is then performed. The item parameters based on the tight lower bound take on an exponential form, allowing us to use a Gaussian distribution as its conjugate prior. Additionally, a new location parameter $\xi $ is introduced in implementing the Taylor expansion, and an extra maximizing step is added to the typical VBEM algorithm to seek the optimal local point $\xi $ . Three simulation studies were carried out in this study: the first two focused on DINA model as the special case of the LCDM, while the third simulation study considered the saturated LCDM. The parameter recovery results from the VBEM-M algorithm were analyzed under simulated conditions. The VBEM-M algorithm was shown to be effective in terms of parameter recovery, execution time, and convergence rate. In addition, the estimation accuracy and computation time of the VB, MCMC, and EM algorithms were investigated in depth.
To begin with, it was found that the VBEM-M algorithm produces favorable results in terms of parameter recovery, providing three main benefits. First, the VBEM-M algorithm can be implemented under various sample sizes, and its accuracy improves as the sample size increases. Based on the DINA model, we found that higher attribute correlation does not affect $\boldsymbol {\eta }$ estimates but improves $\boldsymbol {\lambda }$ estimation accuracy. In addition, the convergence rate of the VBEM-M algorithm is fast, and it is not sensitive to the choice of initial values. It brings considerable efficiency gains, converging to the true values in only approximately ten iterations for different simulation conditions.
The second benefit is that the VBEM-M algorithm has a considerable accuracy advantage over other algorithms, especially when the sample size is small. For instance, in the DINA model with $N=200$ , $K=5$ , and $\sigma =0.3$ , under the LNL condition, the RMSEs of $\boldsymbol {\lambda }$ using VBEM-M, VB, MCMC-dina, MCMC-R2jags, EM-GDINA, and EM-CDM are 0.4500, 0.5507, 0.5470, 0.5554, 1.0239, and 1.0238, respectively. It is evident that our method shows significant advantages, particularly outperforming EM algorithms. However, this benefit diminishes as the sample size increases. This makes the VBEM-M algorithm more reliable in situations with smaller sample sizes, which are often occurs in real-world applications.
Finally, the VBEM-M algorithm stands out for its computational efficiency. While not as fast as the EM algorithms, it still holds an advantage over other algorithms. For example, based on the DINA model with $N=2,000$ , $J=30$ , $K=5$ , and $\sigma =0.3$ , it takes an average of 0.3686s, 0.5136s, 93.5225s, 2061.8450s, 0.1949s, and 0.2097s for VBEM-M, VB, MCMC-dina, MCMC-R2jags, EM-GDINA, and EM-CDM, respectively, across 100 replications. Compared to the two EM algorithms, our algorithm showed the time differences of only 0.1737s and 0.1589s, respectively, and it outperformed the other algorithms. This suggests that the VBEM-M algorithm performs well in terms of computational efficiency.
While the VBEM-M algorithm has its advantages, it also has some limitations. For instance, as mentioned above, the VBEM-M algorithm could not perform as fast as EM algorithm. In addition, the VBEM-M algorithm is essentially an approximation of the posterior distribution of parameters, which works well for the DINA model and some LCDM submodels, as showed in the Supplementary Material. However, its performance in complex LCDMs with high attribute dimensions (like a 32-dimensional $\boldsymbol {\lambda }^*$ for $K=5$ ) still needs to be investigated.
In future studies, first, we will consider to explore whether the VBEM-M algorithm can be generalized to other types of CDMs, such as polytomous CDMs and longitudinal CDMs. Second, in this study, the $\text {Q}$ -matrix was calibrated in advance; however, in practice, there is a potential for mis-specification (Rupp & Templin, Reference Rupp and Templin2008a). Therefore, we will modify the VBEM-M algorithm to simultaneously estimate the $\text {Q}$ -matrix and model parameters. Third, while the VBEM-M algorithm converges quickly, it still operates slower than the EM algorithm in terms of computation time. We plan to further optimize the code associated with C++ or Fortran to increase its speed.
Supplementary material
The supplementary material for this article can be found at https://doi.org/10.1017/psy.2024.7.
Data availability statement
Publicly available datasets were analyzed in this study. The two datasets can be found as follows: https://cran.r-project.org/web/packages/CDM/index.html.
Funding statement
This research was supported by the general projects of National Social Science Fund of China on Statistics (Grant No. 23BTJ067).
Competing interests
The author has no conflicts of interest to declare that are relevant to the content of this article.