Introduction
Reliability refers to the consistency of test takers’ performance across different conditions to ensure fundamentally consistent assessment records for judgments and inferences based on test scores (Chapelle et al., Reference Chapelle, Enright and Jamieson2008). The various conditions, such as testing environment, test tasks, and test scoring, could lead to random measurement errors, potentially reducing the reliability of test scores (Brown, Reference Brown2005). Reliability generalization (RG) is a meta-analytic approach to evaluate the variability of reliability coefficients and disentangle the sources of measurement errors across studies (Vacha-Haase, Reference Vacha-Haase1998). RG assesses the overall score reliability for a given measure and evaluates reliability coefficients of test scores obtained from distinct samples across substantive studies with varying design characteristics to identify the optimal predictor for variability in reliability coefficients. Additionally, an RG study can identify the typical measurement conditions that lead to fluctuations in reliability coefficients for test scores, which can potentially explain lower or higher score reliability (Yin & Fan, Reference Yin and Fan2000).
A growing body of RG studies has investigated the measurement error variances in diverse psychometric scales and inventories within the domains of psychology (e.g., Núñez-Núñez et al., Reference Núñez-Núñez, Rubio-Aparicio, Marín-Martínez, Sánchez-Meca, López-Pina and López-López2022), language learning and listening assessment (e.g., Aryadoust, Soo & Zhai, Reference Aryadoust, Soo and Zhai2023; Shang, Aryadoust & Hou, Reference Shang, Aryadoust and Hou2024; Zhai & Aryadoust, Reference Zhai and Aryadoust2024), and others (e.g., Hess, McNab & Basoglu, Reference Hess, McNab and Basoglu2014). In addition, a number of predicting variables, such as subject type, number of instrument items, and standard deviation (SD) of the total scores, were identified in empirical studies across different domains, in alignment with the theoretical underpinnings in the specific field and contextual intricacies in empirical studies (e.g., Aryadoust et al., Reference Aryadoust, Soo and Zhai2023; Hess et al., Reference Hess, McNab and Basoglu2014; Núñez-Núñez et al., Reference Núñez-Núñez, Rubio-Aparicio, Marín-Martínez, Sánchez-Meca, López-Pina and López-López2022). While there are several reliability meta-analysis studies of second language (L2) listening tests (e.g., Shang, Reference Shang2024) and the overall L2 research (Plonsky & Derrick, Reference Plonsky and Derrick2016), to our knowledge, no previous studies have conducted a comprehensive meta-analysis specifically focusing on the reliability of L2 reading comprehension tests across diverse linguistic and cultural contexts. With regard to L2 assessment, reading comprehension has been regarded as a crucial factor in determining language learners’ proficiency, whether in a first, second, or foreign language (Taylor, Reference Taylor2013). For example, reading is viewed to be a necessary skill for academic success (Grabe & Stoller, Reference Grabe and Stoller2020), as it allows students to comprehend and engage with complex texts across various subjects. Furthermore, strong reading comprehension skills are essential for effective communication and critical thinking (Snow, Reference Snow2002). Studies have found that fostering reading skills is crucial for developing higher-level cognitive abilities, including problem-solving and analytical thinking, which are essential for academic and professional success (Medranda-Morales, Mieles & Guevara, Reference Medranda-Morales, Mieles and Guevara2023). To ensure the robustness of assessments in this language skill, it is important to minimize measurement errors and maximize reliability of test scores. Therefore, in this study, we conduct a systematic investigation to identify and examine reliability and its predicting factors in L2 reading comprehension assessment tools. In the following section, we review the factors that predict reliability in L2 reading comprehension assessment tools.
Predictors of reliability in L2 reading comprehension assessment tools
Based on Weir’s (Reference Weir2005) model of reading test validation, potential sources of variance in test scores (Brown, Reference Brown2005), and recommended coding categories for meta-analyses in general (Lipsey & Wilson, Reference Lipsey and Wilson2001; Wilson, Reference Wilson, Cooper, Hedges and Valentine2019), we identified 21 potential predictor variables that might moderate the commonly reported coefficient alphas for L2 reading comprehension tests across studies. These predictors were organized into three categories: study-related variables, test taker–related variables, and test-related variables.
Study-related predictors
Study design
Based on previous research on the effect of research design on study quality (e.g., Hou & Aryadoust, Reference Hou and Aryadoust2021), it is hypothesized that coefficient alpha, serving as an indicator of study precision, is influenced by different research designs. Research design in the present study refers to the type of research approach, such as experimental or nonexperimental, that was specifically proposed as one of the crucial considerations for L2 research meta-analysis (Oswald & Plonsky, Reference Oswald and Plonsky2010).
Study context
Study context (e.g., English as a second language [ESL], English as a foreign language [EFL]) was included as another potential predictor consistent with the assumption that L2 learning is contingent on the social and contextual variables where learning occurs (Gass et al., Reference Gass, Behney and Plonsky2013) and it is also a standard approach in earlier meta-analytical research focusing on reliability estimates (Watanabe & Koyama, Reference Watanabe and Koyama2008).
Test scores
Total scores achieved in L2 reading comprehension tests, along with the mean and SD of test scores, constituted the three additional predictors. Given the variance in total scores across different tests, standardization was necessary, requiring the transformation of raw scores into z-scores. This required including both the mean of the test scores and their total scores. These variables have been empirically investigated in several past RG studies (e.g., Núñez-Núñez et al., Reference Núñez-Núñez, Rubio-Aparicio, Marín-Martínez, Sánchez-Meca, López-Pina and López-López2022).
Sample size
Sample size has been identified as the most commonly used predictor variable in prior RG studies (e.g., Vacha-Haase & Thompson, Reference Vacha-Haase and Thompson2011). Reliability estimates could fluctuate in tandem with sample size; however, if test takers are of a homogenous nature in terms of language proficiency, the variability in the sample would determine whether a larger sample size could exert a positive or negative effect on reliability (Plonsky & Derrick, Reference Plonsky and Derrick2016).
Test taker–related predictors
Variation in reliability coefficients could be due to the characteristics of test participants in studies. Participant heterogeneity, especially in L2 assessment, may arise from individual attributes that are unrelated to language ability (O’Sullivan & Green, Reference O’Sullivan, Green and Taylor2011). Six test-taker characteristics were identified as predictors potentially contributing to reliability coefficient variability in L2 reading performance, including test takers’ age, gender, first language (L1), L2 proficiency level, English learning experience, and educational background.
Age and gender were empirically reported to influence learners’ reading comprehension levels. Comprehension abilities vary with age as children develop literacy skills, improving word-level decoding, vocabulary knowledge, and lexical representations (Peng et al., Reference Peng, Barnes, Wang, Wang, Li, Swanson, Dardick and Tao2018). Gender influences learners’ text preferences and engagement, affecting reading behavior and performance (Lepper, Stang & McElvany, Reference Lepper, Stang and McElvany2021). As for test takers’ L1, empirical evidence confirms that different language distances between L1 and L2 may cause divergent performance of L2 learners in reading assessments (e.g., Melby-Lervåg & Lervåg, Reference Melby-Lervåg and Lervåg2014). Other factors, such as cultural influences on comprehension processes (Verhoeven & Perfetti, Reference Verhoeven and Perfetti2017) and disparities in educational practices among different language communities (Zhu & Aryadoust, Reference Zhu and Aryadoust2020), may also result in variability in reading assessment outcomes. Additionally, test takers’ L2 proficiency level was observed to moderate L2 development and test performance in empirical studies. For example, learners with higher proficiency levels were observed to effectively employ reading strategies, resulting in better reading performance than those with lower proficiency levels (McGrath, Berggren & Mezek, Reference McGrath, Berggren and Mezek2016). Finally, L2 learners’ previous learning experience and educational background could shape their advanced literacy development, as these factors would determine how well L2 learners adjust to the educational practices and literacy expectations in higher-level L2 courses (Grabe & Yamashita, Reference Grabe and Yamashita2022).
Test-related predictors
Test type
Test type, specifically referred to as whether the test items are from standardized tests, from institutional tests, or created by researchers or teachers, may be influenced by macrolevel factors, such as dominant paradigms in language testing and assessment culture, and by microlevel factors, including test writer’s cultural background, previous experience, etc. (Shin, Reference Shin, Fulcher and Davidson2012). These factors would in turn determine whether the test items capture the specified language abilities of the test takers, thus suggesting whether the obtained test scores precisely reflect the construct being assessed.
Test purpose
Test purpose involves whether the test is for admitting the examinees into a university, placing them into a class, evaluating their progress, or diagnosing their difficulties in learning (Grabe & Yamashita, Reference Grabe and Yamashita2022). High-stakes testing and some achievement assessments involve decisions concerning examinees’ future opportunities, thereby being more constrained by concerns of reliability than such low-stakes testing as diagnosis assessment (Grabe & Yamashita, Reference Grabe and Yamashita2022).
Test piloting
Piloting involves trialing test items to ensure item validity and appropriateness of difficulty level for the target test takers (Fulcher, Reference Fulcher2013). Test piloting, as a key element in the test design phase, could help improve the validity and reliability of instruments before the actual administration (Grabowski & Oh, Reference Grabowski, Oh, Phakiti, De Costa, Plonsky and Starfield2018). Prior studies have identified a correlation between reliability estimates and piloting status, finding that instruments that were not reported as piloted demonstrated higher reliability than those that were piloted (Plonsky & Derrick, Reference Plonsky and Derrick2016; Sudina, Reference Sudina2021, Reference Sudina2023). Given these findings, piloting status of reading comprehension instruments was postulated as a predictor variable to further explore its impact on reliability.
Test time limit
Time constraints in reading assessments serve as a crucial factor in evaluating the development of automaticity and comprehension (Alderson, Reference Alderson2000), where constrained test time may affect test takers’ performance (Weir, Reference Weir2005). Regarding reading performance, empirical studies suggested that the time duration allocated for test taking can affect test takers’ ability to demonstrate their comprehension skills (Martina, Syafryadin, Rakhmanina & Juwita, Reference Martina, Syafryadin, Rakhmanina and Juwita2020). The impact of time constraints has also been investigated in previous meta-analyses regarding reliability, suggesting time constraints have more correlation with interrater reliability than internal consistency (e.g., Plonsky & Derrick, Reference Plonsky and Derrick2016). To further investigate the effect of time constraints on coefficient alpha, we included this variable as a potential predictor.
Test format
Test format, such as multiple choice and other response formats, has been discussed concerning its effects on predicting test takers’ reading performance in previous empirical research (e.g., Lim, Reference Lim2019) and meta-analytical studies (e.g., In’nami & Koizumi, Reference In’nami and Koizumi2009). Given that test format tends to affect test scores unpredictably, it has been conceived as a potential source of variances unrelated to the intended construct (Alderson, Clalpham & Wall, Reference Alderson, Clapham and Wall1995). It is therefore postulated to influence the reliability of reading test scores, since it can affect the difficulty level of the test, which in turn impacts the amount of measurement error and thus the reliability of the test scores.
Testing mode
Given that reading medium could affect the processing of the text (Alderson, Reference Alderson2000), the effects of paper-and-pen–based and computer-based tests on reading comprehension outcomes have been extensively discussed in empirical studies. Empirical evidence suggested that emergent digital reading in L2 is not simply a binary opposition to print reading but rather an extension of it, influenced by the characteristics of digital reading environments, tasks, and readers (Reiber-Kuijpers, Kral & Meijer, Reference Reiber-Kuijpers, Kral and Meijer2021). These extraneous factors would potentially affect L2 reading test performance when digital reading is applied in reading assessment. Additionally, other factors in testing mode were also observed to influence reading test outcomes, such as computer familiarity (Chan, Bax & Weir, Reference Chan, Bax and Weir2018), mode preference (Khoshsima, Hosseini & Toroujeni, Reference Khoshsima, Hosseini and Toroujeni2017), and digital reading habits (Støle, Mangen & Schwippert, Reference Støle, Mangen and Schwippert2020). Considering this, the testing mode was posited to be a potential source of measurement error, thus possibly affecting reliability of reading comprehension instruments.
Cognitive levels
According to Weir’s cognitive model of reading (Khalifa & Weir Reference Khalifa and Weir2009), competent reading requires readers to engage in various cognitive skills; accordingly, reading tests need not solely assess knowledge of information but also evaluate test takers’ mastery of the cognitive processes involved in applying that knowledge. Whether reading tasks tap into local-level or global-level comprehension would involve different cognitive processes, thereby affecting reading performance (Liu, Reference Liu2021). If the cognitive processes targeted by reading tests are either too easy or too challenging for the test-taking sample, this misalignment can significantly reduce the reliability of the test scores by introducing significant sources of measurement error (Fulcher, Reference Fulcher2013; Jones, Reference Jones, Fulcher and Davidson2012).
Test length
Two variables, the number of test items and the length of reading text in reading comprehension tests, have been posited as salient factors that affect reliability estimates of measurement (Fulcher, Reference Fulcher2013). An increase in test item numbers could cause the corresponding increase in reliability, as mathematically predicted by the Spearman-Brown formula and further confirmed by previous RG studies (e.g., Aryadoust et al., Reference Aryadoust, Soo and Zhai2023). Furthermore, text length, a proxy for reading load, could potentially explain reliability variation in reading comprehension measurement outcomes, as longer text in reading passages would impose more cognitive load on examinees and hence affect their reading performance (Green, Ünaldi & Weir, Reference Green, Ünaldi and Weir2010). Text length investigated in this study involves the number of texts included in the reading comprehension tests.
The present study
The present study aims to conduct an RG meta-analysis to estimate the average reliability estimate from studies involving L2 reading comprehension tests and explore potential moderators explaining variation in the reliability coefficients. The following research questions are posed to fulfill the research purposes:
-
1. What is the average reliability coefficient of L2 reading comprehension tests?
-
2. What are the potential moderators of the reliability of L2 reading comprehension tests?
We note that meta-analytic studies integrate effect sizes from diverse research contexts, participants, and instruments, operating under the premise that the construct of interest remains consistent across various measurement tools (Lipsey, Reference Lipsey, Cooper, Hedges and Valentine2019). This approach enables a comprehensive assessment of overall effect size and construct reliability. In our study, we aim to apply this principle to reading comprehension tests, assuming that despite the variety of instruments, they all measure the reading comprehension construct, albeit in different contexts and with different test takers. We further note that instrument and test taker characteristics can potentially introduce construct-irrelevant variance and sources of measurement error, as highlighted by Purpura (Reference Purpura2004) and Messick (Reference Messick1995). These characteristics are critical in determining test score reliability, with Thompson (Reference Thompson1994) noting that participant homogeneity or heterogeneity significantly influences reliability outcomes. Recent RG studies, such as those by Sen (Reference Sen2022) and Núñez-Núñez et al. (Reference Núñez-Núñez, Rubio-Aparicio, Marín-Martínez, Sánchez-Meca, López-Pina and López-López2022), have further explored the impact of sample characteristic variables on reliability, underscoring the importance of considering these factors in meta-analytic research on reading comprehension assessments. Thus, our study will synthesize data from multiple research efforts, focusing on how different test taker characteristics influence the reliability of these assessments. This analysis will contribute to the broader discussion on the universal applicability of reading comprehension tests and offer some insights into their reliability across different test taker groups and conditions.
Method
Literature search
The literature search in the present study was conducted assuming that journals represent the major means of dissemination in L2 research rather than book chapters or other publications (Plonsky & Derrick, Reference Plonsky and Derrick2016). Additionally, this study aims to obtain a general overview of reliability reporting practices in L2 reading assessments; therefore, top-tier journals, as a parameter of study quality (Ada, Sharman & Balkundi, Reference Ada, Sharman and Balkundi2012), were selected to extract representative samples in this domain.
Accordingly, we restricted the literature search to 55 top-tier, peer-reviewed journals that publish research on applied linguistics and L2 learning, teaching, and assessment, drawing from Zakaria and Aryadoust’s (Reference Zakaria and Aryadoust2023) study (see Appendix for journal list). This exclusive focus, admittedly, might generate selection and publication bias or file-drawer problem (Field & Gillett, Reference Field and Gillett2010). To alleviate potential bias and maximize possible coverage in the pertinent domains, we also included six major reading journals (see Appendix) and studies included in meta-analyses related to L2 reading and reliability (In’nami et al., Reference In’nami, Hijikata and Koizumi2022; Shin, Reference Shin2020; Zhang & Zhang, Reference Zhang and Zhang2022) in the literature search.
We followed Preferred Reporting Items for Systematic reviews and Meta-Analyses (PRISMA) 2020 guidance (Page et al. Reference Page, McKenzie, Bossuyt, Boutron, Hoffmann, Mulrow, Shamseer, Tetzlaff and Moher2021) and database selection guidelines (In’nami & Koizumi, Reference In’nami and Koizumi2010) to extend the specified journal search to four databases germane to the field: Scopus, the Web of Science, Education Resources Information Center, and Linguistic and Language Behavior Abstracts.
We searched the aforementioned databases with a combination of the key terms “reading test” OR “reading assess*”, “reading comprehen*” OR “reading abilit*”, “second language”, “foreign language”, L2, and “bilingual*”, in light of Shin’s (Reference Shin2020) meta-analysis. No time range was set, but considering comprehensibility, the language was limited exclusively to English. The search procedure was conducted on March 5, 2024, yielding a total of 3,247 articles. After cross-checking titles and abstracts, we removed duplicate results (n = 1,569), publication news (n = 10), and studies not in English (n = 9), leaving 1,659 articles for further scrutiny.
Inclusion and exclusion criteria
We downloaded and reviewed the full text of the 1,659 articles in accordance with the following inclusion and exclusion criteria: (a) the study employed a quantitative methodology; (b) the study involved L2 reading comprehension test at the passage level; (c) the study collected data from L2 learners; (d) the study encompassed English reading comprehension tests; and (e) the study reported information on reliability estimates and sample size of test takers and included more than two items in the test to meet the basic statistical prerequisites.
After removing primary studies that did not meet the previous five inclusion and exclusion criteria, as well as studies with duplicate data (n = 4) and unretrievable studies (n = 8) from conference proceedings, we thoroughly examined the full text of the 353 eligible articles with a focus on reliability estimates. A further 240 articles were excluded for the following reasons: (a) studies (n = 139) applied reliability indices other than Cronbach’s alpha; (b) studies (n = 10) reported ranges of coefficient alpha; (c) studies (n = 11) reported coefficient alpha for the test battery rather than exclusively for the reading comprehension test in the battery; (d) studies (n = 68) provided inducted alpha estimates; and (e) studies (n = 12) reported coefficient alphas from the pilot test.
Finally, the rest of the 113 articles with 150 reliability coefficient reports were extracted for the subsequent coding and analysis (see Supplementary Material). The PRISMA flowchart (Moher et al., Reference Moher, Liberati, Tetzlaff, Altman and Group2009) in Figure 1 demonstrates the procedure of literature search and data screening.
The coding scheme
Based on the aforementioned theoretical framework comprising the 21 predictor variables, we designed a coding scheme to capture the characteristics of the primary studies and the potential moderators affecting the reliability coefficients of L2 reading comprehension tests. In the development of the coding scheme, we piloted, revised, and refined codes for variables through an iterative process. The included primary studies were coded in line with the 21 identified potential predictor variables. Along with alpha value and five study descriptors, a total of 27 variables for coding were organized into four categories as study-related variables, test taker–related variables, test-related variables, and observed reliability (see Table 1).
In addition, we also coded for the variable cognitive levels reported by the primary studies. However, 58 out of 150 (38.67%) reliability reports did not document the cognitive levels of the test tasks, which exceeded the minimum of missing values (10% of the data points) in statistical analysis. Similarly, the variable of time constraint was not reported in 39 out of 150 data points (26%) regarding whether a time limit was imposed on the reading comprehension test. To ensure more accurate statistical analysis with fewer missing values (10%), we ultimately excluded these two potential predictors.
Study characteristics
Overall, 150 Cronbach’s reliability coefficient reports from 113 primary studies were coded for subsequent analyses. A majority of the studies were journal articles (n = 107), followed by unpublished dissertations (n = 5) and book chapters (n = 1). Among the journals, Language Testing has published the highest number of research (n = 20) in L2 reading comprehension assessments, followed by Reading and Writing (n = 9), Language Learning (n = 6), and System (n = 6). There was a total sample size of 70,292 participants for the studies (n = 113) applying Cronbach’s estimates. A large proportion of the primary studies (n = 65) involved participants from tertiary education, constituting 57.52% of the total corpus. In terms of research characteristics, only a small proportion of studies (n = 26) adopted an experimental design, accounting for 23% of all the primary studies. Among the primary studies, 76.11% (n = 86) were conducted in an EFL context, with a majority of participants having Asian languages as their L1s, including Chinese (25.66%), Korean (15.93%), and Farsi (10.62%).
Intercoder reliability
Given the complexity of data retrieval and coding tasks, a second researcher (a master’s student in applied linguistics) inspected the data extraction procedures and therein confirmed the precision of the data generation. Intercoder reliability was calculated for randomly selected 24 studies (21.24%), with 33 (22%) reliability estimates in the dataset, yielding an agreement rate of 96.78%. The discrepancies in terms of intercoder agreement were discussed and resolved by consensus.
Statistical analyses
The statistical analyses include the following steps: (a) data preparation: evaluating normality of data and identifying potential outliers with standardized residuals and Cook’s distances; (b) fitting a multilevel meta-analytic model incorporating three components: sampling error variance, within-study variance, and between-study variance; (c) assessing heterogeneity across studies through Cochran’s Q test and I2 indices; (d) evaluating publication bias with Egger’s regression test and trim-and-fill method; and (e) conducting a moderator analysis with the R2 index to quantify the observed variances that can be explained by the moderator variables.
Data preparation
Following established meta-analytic practices (Lipsey & Wilson, Reference Lipsey and Wilson2001; Cooper, Hedges & Valentine, Reference Cooper, Hedges and Valentine2019), we began with data preparations by checking the normality of Cronbach’s coefficients. Skewness and kurtosis values between the range of ±2 and ±7 separately suggest a normal distribution of reliability coefficients in the dataset (Kline, Reference Kline2016). We also evaluated potential outliers in Cronbach’s alphas with standardized residuals and Cook’s distances to ensure model fitting. Standardized residuals of the reliability estimate exceeding a threshold of 100 × (1 – 0.05/[2 × k])th percentile of a standard normal distribution are considered outliers (Viechtbauer & Cheung, Reference Viechtbauer and Cheung2010). A Bonferroni correction was applied with a significance level of 0.05 (α = 0.05) for Cronbach’s alphas in the model. In addition, the reliability coefficient with Cook’s distances surpassing the median plus six times the interquartile range of Cook’s distances will be considered an overly influential data point (Viechtbauer & Cheung, Reference Viechtbauer and Cheung2010).
Model fitting
We fitted a three-level meta-analytical model considering the nonindependency of Cronbach’s alphas in the selected studies and the fact that the studies included were conducted in different contexts with a variety of participants, which can contribute to the proportion of between-study variability (Borenstein, Hedges, Higgins & Rothstein, Reference Borenstein, Hedges, Higgins and Rothstein2009). Thus, the multilevel model in this study partitioned the reliability coefficients variability into three components: sampling error variance, within-study variance, and between-study variance, accordingly, constructing a level one model (participants), level two model (reliability estimates), and level three model (studies) (Assink & Wibbelink, Reference Assink and Wibbelink2016; Harrer Cuijpers, Furukawa & Ebert, Reference Harrer, Cuijpers, Furukawa and Ebert2022). No transformation of reliability coefficients was applied, which is in accordance with the recommendation from Thompson and Vacha-Haase (Reference Thompson and Vacha-Haase2000).
Heterogeneity test
We scrutinized heterogeneity through Cochran’s Q test and I2 indices. A statistically significant p value (p < .05) in the Q test indicates that factors beyond sampling error contribute to the variability observed across the primary studies. I2 indices of approximately 25%, 50%, and 75% were considered low, moderate, and large heterogeneity, respectively (Higgins, Thompson & Deeks, Reference Higgins, Thompson and Deeks2003).
Publication bias
To examine the potential publication bias, we applied Egger’s regression test (Sterne & Egger, Reference Sterne, Egger, Rothstein, Sutton and Borenstein2005), with a significant p value (p < .05) indicating potential publication bias in the included studies. The trim-and-fill method (Duval & Tweedie, Reference Duval and Tweedie2000) was also utilized to help estimate the impact of the potential missing studies on the meta-analysis due to potential publication bias.
Moderator analysis
We conducted a moderator analysis to examine the degree to which the coded predictors explained the variation in the reliability coefficients. However, missing data on the predictors (Pigott, Reference Pigott, Cooper, Hedges and Valentine2019) and the presence of too many moderators in a meta-analysis (Baker et al., Reference Baker, White, Cappelleri, Kluger and Coleman2009) may result in a higher likelihood of a false-positive result. Thus, following Ihlenfeldt and Rios (Reference Ihlenfeldt and Rios2023), we included 12 predictors with missing values less than 10% of reliability estimates in the moderator analysis. To assess the statistical significance of the moderator variables and explicate the residual heterogeneity, we used an improved F statistic (Knapp and Hartung, Reference Knapp and Hartung2003). The R2 index was employed to quantify the extent to which the observed variances were accounted for by the moderator variables.
All statistical analyses were conducted with the metafor and dmetar (Viechtbauer, Reference Viechtbauer2010) packages in RStudio for macOS.
Results
The average reliability coefficient
Normality checks of Cronbach’s alphas (n = 150) indicated a normal distribution (skewness = –0.54, kurtosis = –0.16, M = 0.79, SD = 0.09). In the fitting of the three-level meta-analytical model, no outlier was identified by the examination of the standardized residuals, as none of the coefficient values was larger than ±3.43, the threshold for a standard normal distribution. However, Cook’s distances detected one outlier (Choi, Kim & Boo, Reference Choi, Kim and Boo2003) in all Cronbach’s alphas across studies. A sensitivity analysis was performed to assess the impact of the outlier. The results showed slight differences in the model fit indices of Akaike Information Criterion (AIC) (AICwith outlier = –313.89; AICwithout outlier = –321.97) and negligible variation in the mean (M) reliability estimates and confidence intervals (Mwith outlier = 0.794, standard error [SE]with outlier = 0.008, 95% CI [0.776, 0.809], p < .001; Mwithout outlier = 0.793, SEwithout outlier = 0.008, 95% CI [0.777, 0.810], p < .001) in model fit. These results suggest a marginal impact of the identified outliers on the model fitting (Aguinis, Gottfredson & Joo, Reference Aguinis, Gottfredson and Joo2013). Therefore, the raw data were retained for the subsequent analyses without removing the outlier.
A total of 150 Cronbach’s coefficients were included in the calculation of the average reliability coefficients, with observed Cronbach’s alphas ranging from 0.51 to 0.98. The bubble plot in Figure 2 demonstrates the estimated average Cronbach’s value, represented by the red indented line in the upper quadrants of the plot, which is equal to μ = 0.79 (95% CI [0.78, 0.81]). The values of 62 (41.33%) out of 150 reliability coefficients were below the lower bound of the CI.
Heterogeneity test
The results of a heterogeneity test (Cochran’s Q test, τ2, and I2) of Cronbach’s alphas suggested significant heterogeneity across studies (Q [149] = 9200.69, p < .001, τ2 = 0.01, I2 = 98.98%), with 95% prediction interval ranging from 0.65 to 0.95. This result was confirmed by a graphical display of study heterogeneity (GOSH) plot with a normal and symmetric distribution of Cronbach’s coefficients at the top and the deviating dotted graph of the I2 indices below, as displayed in Figure 3. The vertical histogram on the right represents the distribution of the I2 indices, which is skewed with many of the indices falling roughly above 95%, indicating a large proportion of heterogeneity across reliability coefficients.
To investigate the sources of heterogeneity, the three levels of variance in the meta-analytical model were evaluated individually. Figure 4 presents the total variance distribution across the three levels. The variance in the first level, representing sampling error, constitutes a relatively small proportion of approximately 3.33% of the total variance. A larger amount of heterogeneity variance within studies is observed at level 2, accounting for approximately 23%. The most substantial proportion is found at level 3, where between-study heterogeneity accounts for approximately 73.67% of the overall variation. In the presence of conspicuous between-study variation in the reliability coefficients, it is important to investigate moderators to elucidate potential causes of the variation (Baker et al., Reference Baker, White, Cappelleri, Kluger and Coleman2009).
Publication bias
The trim-and-fill funnel plot in Figure 5 reveals an asymmetric dispersion of Cronbach’s coefficients, with a majority of the coefficients located at the upper section of the funnel and some sparsely scattered at the left bottom, suggestive of the presence of potential publication bias. The Egger’s test substantiated this result with a statistically significant p value (z = –9.32, p < .001). Nonetheless, a trim-and-fill method did not confirm a substantial impact on the meta-analysis, the results of which suggest a possible absence of four studies which, if added, would adjust the mean Cronbach’s coefficient from 0.794 to 0.799 and increase variance component from 98. 98% to 99.06%.
Moderator analysis
To assess which predictors of interest might moderate the reliability coefficients of reading comprehension tests, we conducted a moderator analysis for 10 categorical predictor variables and 2 continuous predictor variables. Table 2 presents the results of the omnibus test and post hoc tests, where each category is compared against the anchored category, which is the intercept in this dataset. Overall, 5 out of 12 variables were found to be significant predictors of heterogeneity in the Cronbach’s alpha coefficients, explaining 31.53% of the variance observed. Of the 10 categorical moderators, 4 variables (test piloting, study design, test taker’s educational institution, and testing mode) were found to have a significant moderating effect, together explaining 14.77% of the variance observed. Specifically, test piloting was observed to be a statistically significant moderator of the reliability estimates (F [1, 148] = 11.87, p < .001), with an R2 value of 5.92%, indicating that 5.92% of the between-study variation could be attributed to whether a pilot test was conducted prior to the operationalization. Test taker’s educational institution showed a statistically significant moderating effect (F [3, 145] = 9.65, p = .02), too, accounting for 4.91% of the variance between studies, which could be ascribed to the different educational backgrounds of test takers. Study design appeared to be another statistically prominent moderator (F [1, 148] = 5.68, p = .02), evidenced by an R2 value of 2.58%, suggesting that 2.58% of the between-study variance could be explained by whether the primary study adopted an experimental or nonexperimental design. The variable testing mode also emerged as a statistically noticeable moderator (F [1, 148] = 3.98, p = .05), accounting for 1.36% of variances among alpha values. This indicates that 1.36% of variances between studies could be attributed to whether the test was computer based or paper based. Other six categorical variables, including study context, text length, test type, test purpose, test format, and test takers’ L1, had weak or no obvious moderating effects.
Note: MC = multiple choice; T/F = true/false.
* p < .05
† p < .001.
We further performed a metaregression for two continuous predictors, sample size and the number of test items, to evaluate their moderating effect on the reliability coefficient outcomes. The sample size variable was observed to have no significant effect on the coefficients (F [1] = 1.36, β = 0, SE = 0, p = .24), which is confirmed by the R2 value (R2 = 0), while the number of test items displayed a noticeable moderating effect (F [1] = 26.23, p < .001), with a relatively small regression coefficient (β = 0.003, SE = 0.001) (see Figure 6). This suggests that there is a positive association between the reliability coefficient outcomes and the total number of test items. Overall, 16.76% of between-study variability could be accounted for by the number of test items, as indicated by the R2 value (R2 = 16.76%).
Discussion
Average reliability and heterogeneity
The first objective of the present study was to investigate the average reliability coefficients obtained in empirical L2 reading comprehension studies. We obtained an average Cronbach’s alpha of 0.79 (95% CI [0.78, 0.81]), which is lower than 0.82 reported in a meta-analysis of all reliability coefficient indices in L2 research based on a larger body of 1,112 Cronbach’s alphas (Plonsky & Derrick, Reference Plonsky and Derrick2016). While this average reliability value is regarded as satisfactory for ability tests (Field, Reference Field2018), within the domain of language assessment and educational research, this value would be considered fairly moderate, according to more stringent benchmarks (Brown, Reference Brown and Kunnan2014; Taber, Reference Taber2018). It should be noted that low-reliability estimates can result in reduced statistical power and attenuated effect size in substantive research, which would undermine the accuracy and robustness of the research findings (Oswald & Plonsky, Reference Oswald and Plonsky2010; Plonsky, Reference Plonsky2013).
It is also noteworthy that 62 out of 150 coefficient alpha values were below the lower bound of the CI, which might attenuate the overall internal consistency of L2 reading comprehension tests included in this RG meta-analysis. Specifically, the lower reliability in these 62 reports may indicate potential issues in the measurement tools or methods they adopted, such as poor item construction or misalignment with the intended construct (Meyer, Reference Meyer2010). The reduced reliability of these studies might stem from factors like the tools’ multidimensionality, the clarity and relevance of measurement items, and whether the instrument comprehensively covers the construct (Jones, Reference Jones, Fulcher and Davidson2012). To facilitate meta-analytical examinations of the potential causes of low reliability, it is suggested that future authors should include the tests and raw data in their publications.
Relatedly, it was found that the alpha coefficients included in the present study were significantly nonuniform. The result of the heterogeneity test exhibited a large amount of variability among Cronbach’s coefficients (I2 = 98.98%), in which study-level variances accounted for a significant share (73.67%). Given the variation in testing contexts, testing conditions, and participants, it is not surprising that the observed reliability coefficients align with this variability and depend on contextual factors, population characteristics, and other study-specific variables. This, therefore, provides further evidence showing that reliability estimates like Cronbach’s alpha are highly susceptible to the specific characteristics and conditions of each study (Zakariya, Reference Zakariya2022).
Moderators of reliability coefficients
To answer the second research question, a moderator analysis was conducted on 10 categorical predictor variables and 2 continuous predictor variables. Five variables (the number of test items, test piloting, test takers’ educational institution, study design, and testing mode) were found to have a significant moderating effect on coefficient alpha values. Two other variables, study context and text length, appeared to marginally contribute to the explained between-study variability of coefficient alphas, notwithstanding their lack of statistical significance. The remaining five variables—test takers’ L1, test type, test purpose, test format, and sample size—had no significant effects on coefficient alphas.
The number of test items emerges as a significant moderator for reliability estimates (F [1] = 26.23, p < .001), explaining the largest amount of variance in coefficient alphas (R2 = 16.76%) across the included primary studies. As expected, the results reveal a positive association between the reliability coefficient outcomes and the total number of test items. This finding is congruent with the prediction of the coefficient alpha formula, suggesting that all other things held constant, an increase in the number of test items will increase reliability coefficient (Taber, Reference Taber2018; Zhai & Aryadoust, Reference Zhai and Aryadoust2024). In the domain of language assessment, as long as the stochastic independence of items (i.e., the independence between-item responses) holds, the increase of test items would increase the reliability of measurements, as each test item contributes to the information about the test taker’s ability (Fulcher, Reference Fulcher2013). Previous RG studies for psychometric scales or questionnaires have also found a positive relationship between reliability coefficients and the number of items (e.g., Aryadoust et al., Reference Aryadoust, Soo and Zhai2023; Sen, Reference Sen2022). The present study shows that in the context of reading assessment, the alpha coefficient is partially dependent on the number of test items.
Test piloting is another significant moderator of reliability coefficients (R2 = 5.92%, p < .001), accounting for a small amount of heterogeneity between studies. Notably, reading instruments with pilot testing exhibited a negative coefficient (slope) compared to instruments without test piloting reports (the reference group) (see Table 2), indicating that reading instruments that underwent a pilot test tend to have lower reliability estimates compared to those that did not undergo a pilot test. This finding, though surprising, aligns with the findings of previous studies (Plonsky & Derrick, Reference Plonsky and Derrick2016; Sudina, Reference Sudina2021, Reference Sudina2023). Plonsky and Derrick (Reference Plonsky and Derrick2016) explained that researchers who did not report piloting their instruments tended to choose reliability coefficients with higher median estimates. Sudina (Reference Sudina2023), focusing on Cronbach’s alpha, attributed the lower reliability of piloted instruments to limited transparent reporting, where researchers possibly failed to report their pilot testing. To investigate this further, we reexamined our data and found that only 2 out of 25 studies documented alpha values for both the pilot and final tests. Most of the studies reported piloting to confirm the appropriateness of reading materials, test procedures, and test time. These, admittedly, are valid purposes for test piloting (Mackey & Gass, Reference Mackey and Gass2021), but most researchers seem to overlook important statistical analyses in piloting, particularly internal consistency. Internal consistency analysis could help researchers ensure the validity of instrument items and identify ways to improve items if necessary (Green, Reference Green, Winke and Brunfaut2020). To avoid the potential pitfalls of low reliability, it is imperative to pilot the instrument in advance of the actual administration (Grabowski & Oh, Reference Grabowski, Oh, Phakiti, De Costa, Plonsky and Starfield2018), but equally important to report the reliability statistics of both pilot and main study test scores.
Test takers’ educational institution also demonstrated a significant moderating effect on Cronbach’s alphas (R2 = 4.91%, p = .02), where the intercept “language institute” and the category “primary school” were statistically significant. It may be said that the moderating effect of “language institute” stems from the heterogeneity of the test takers from diverse backgrounds and at different proficiency levels. This finding is supported by the literature indicating that reliability can differ when the same instrument is administered to the participants of heterogenous or homogenous nature (Thompson & Vacha-Haase, Reference Thompson and Vacha-Haase2000; Yin & Fan, Reference Yin and Fan2000). Similarly, reliability of test scores for primary school students may be affected by various extraneous factors, including test takers’ physiological and psychological variations, test methods, and test design (Papp, Reference Papp, Garton and Copland2019). This was evidenced by the findings that the “primary school” category observed a decrease in the coefficient alpha compared to the baseline category (language institute). Further examination of the data in our study revealed that 6 out of 22 reports documented reliability estimates below 0.70.
Study design was also found to moderate coefficient alpha values (R2 = 2.58 %, p = .02). The reliability estimates of studies with nonexperimental design, on average, appeared to be higher than those of the experimental design group, with statistical significance (see Table 2). Further examination of the data revealed that most of the included studies (72%) adopted a quasi-experimental approach without a pretest, control group, or random assignment. A quasi-experimental approach, though useful, might introduce confounding noise, potentially lowering reliability estimates (Gliner, Reference Gliner, Morgan and Leech2017). It is noteworthy that an even larger proportion of the included studies utilized nonexperimental designs, predominantly with intact groups, which would potentially compromise methodological rigor of the studies (Plonsky & Gass, Reference Plonsky and Gass2011). This practice was identified as normative in L2 research possibly due to the prevalence of classroom-based research, where logistical challenges or ethical problems may exist in abundance (Plonsky, Reference Plonsky2013).
Testing mode also appeared to be a statistically significant moderator of Cronbach’s coefficients (R2 = 1.36%, p = .05), where computer-based tests—when compared with the intercept—appeared to be more likely to moderate coefficient alphas than paper-based tests. Previous studies have shown that digitally delivered information can result in subtle changes in test takers’ reading behavior, including the time taken to complete tasks, patterns of eye movement, and self-evaluation of performance (Pengelley, Whipp & Rovis-Hermann, Reference Pengelley, Whipp and Rovis-Hermann2023). These changes could possibly lead to inconsistency of test scores under on-screen testing conditions. This finding further underscores the importance of test piloting, especially when introducing a new testing mode.
Two other variables, study context and text length, explained a small proportion of between-study variability of coefficient alphas, although their moderating effect on coefficient alphas was not statistically significant. Regarding the study context (R2 = 1.27%, p = .17), although not generally significant, the EFL condition had a greater likelihood of influencing coefficient alphas compared to ESL condition across the studies. This distinction might be attributed to variations in language performance due to differing learning contexts. L2 learning contexts, whether in an EFL or ESL context, would influence the quality and quantity of learners’ input, output, and interactions, thereby generating distinctive L2 developmental patterns and skills (Yu, Janse & Schoonen, Reference Yu, Janse and Schoonen2021). In addition, L2 reading performance may also be contingent on the miscellaneous effect of social influences and individual differences (Prater, Reference Prater, Israel and Duffy2009). Thus, the reading performance of L2 learners in an EFL context might exhibit greater variability when compared with their counterparts in an ESL context, but the reliability of test outcomes may not be affected by study context, as suggested by the evidence in this study.
Text length was found to explain a minimal and statistically nonsignificant amount of between-study variability (R2 = .71%, p = .17), where instruments with more than one text seemed to be more inclined to moderate coefficient alphas. Evidence from prior studies suggested that longer texts can affect test takers’ cognitive processing (Green et al., Reference Green, Ünaldi and Weir2010) and increase test takers’ unintentional disengagement (Forrin et al., Reference Forrin, Mills, D’Mello, Risko, Smilek and Seli2021), potentially resulting in variances in test scores. However, the results of this study suggest that the number of texts included in the L2 reading tests does not directly impact the reliability of test scores. Indeed, other text-related factors, such as text readability, text imageability, text genre and intertext relationship, merit attention regarding their relationship with the reliability of test outcomes. We intended to explore these factors initially but were hindered by inconsistent evaluation tools and reporting practices or insufficient reporting of text features in the primary studies. Therefore, we advocate for future research to include these text features to enhance replicability of investigations and enable exploration of their interactions.
The remaining five variables—test takers’ L1, test format, test type, test purpose, and sample size—had no discernible moderating effect on coefficient alphas and did not account for the variances of coefficient alphas across studies. Test takers’ L1 (R2 = 0, p = .91) was not found to explain the variability in moderate coefficient alpha values. Upon finding this result, we further examined L1 and associated alpha values reported in the primary studies, particularly L1–L2 language and script distance, to identify any possible patterns not captured by the omnibus test. However, no noticeable pattern emerged, suggesting that test takers’ L1 is not likely to contribute to the reliability of test scores.
Test format was hypothesized to potentially affect reliability estimates, considering reading questions presented in different formats may assess distinct reading componential skills (Lim, Reference Lim2019). However, the results of the study (R2 = 0, p = .83) did not demonstrate a positive moderating effect. This might be explained by the absence of identified test format effects in L2 reading research (In’nami & Koizumi, 2009). While varied test formats can lead to different item-response processing, as long as the test is “valid,” the extracognitive processes involved in reading test items due to test methods that are unrelated to the construct being measured and thus would not significantly affect the variance in test scores (Lim, Reference Lim2019). The findings of this study indicate that test format in reading instruments may have less likelihood of affecting the reliability of test outcomes.
Test type (R2 = 0, p = .44) and test purpose (R2 = 0, p = .50), two interconnected variables, did not appear to account for variation in coefficient alphas. It was hypothesized that standardized tests might moderate reliability coefficients differently compared to classroom-based tests, as the former encapsulate a range of component reading abilities and include a variety of tasks tailored to participants at varying proficiency levels (Grabe, Reference Grabe2009). Similarly, high-stakes tests and some achievement assessments, which influence examinees’ future opportunities, were posited to be more constrained by reliability concerns than low-stakes tests like diagnostic assessments (Grabe & Yamashita, Reference Grabe and Yamashita2022). However, the results did not substantiate these postulations. This might be attributed to the likelihood that whether standardized or classroom based, high-stakes tests or low-stakes tests exhibit comparable levels of consistency in assessing reading abilities. This finding suggests that the diversity of reading tasks in standardized tests and high-stakes tests may not necessarily result in higher reliability compared to classroom-based tests and low-stakes tests, as reflected in the data analyzed for this study.
Finally, sample size, albeit a crucial concern in study design, did not appear to affect coefficient alphas (F [1] = 1.36, β = 0, p = .24). This outcome may be explained based on alpha’s formula α = $ \frac{N\cdotp \overline {\hskip-0.52em c}}{\overline {\hskip-0.52em v}+\left(N-1\right)\cdotp \overline {\hskip-0.52em c}} $ , wherein the effect of sample size is notably absent in directly affecting the magnitude of coefficient alpha (Peterson, Reference Peterson1994). This observed lack of substantive correlation between sample size and coefficient alpha resonates with the findings from prior studies (e.g., Aryadoust et al., Reference Aryadoust, Soo and Zhai2023; Sen, Reference Sen2022).
Limitations and future studies
Although the present study contributes to theoretical and methodological advances in language assessment, it is subject to several limitations. An arguable limitation is the exclusion of unpublished studies and other studies from the unselected journals in this RG study. This involves a classic trade-off in meta-analysis: prioritizing inclusiveness or study quality. The present study adopts a quality-first approach, given that the reliability coefficient is correlated with the quality of the primary study (for detailed rationales, see Norris & Ortega, Reference Norris, Ortega, Norris and Ortega2006; Oswald & Plonsky, Reference Oswald and Plonsky2010). Future research could consider incorporating these unpublished or unselected studies into the RG study and compare the outcomes with those obtained in the present study.
Another potential limitation is the omission of some potential predictor variables due to insufficient reporting in the primary studies. These include test scores, test takers’ age, gender, proficiency level, time constraints, and some text-specific characteristics, such as cognitive levels, which had to be excluded as a result of missing data in the primary studies. Specifically, we tried to code for the variables of cognitive levels and time constraints but failed to include them in the final computational analysis due to too many missing values. The exclusion of these predictor variables could lead to a less comprehensive understanding of the relationship between reliability and its potential influencing factors. Future authors are urged to provide demographics of test takers and full texts of the reading instruments to promote transparency and enhance the potential for prospective meta-analysis research. The scope of this study is also limited by the use of Cronbach’s coefficients as the sole measure of reliability. While Cronbach’s alpha is the most commonly used index for reliability estimate in our data pool, it is worth noting that coefficient alpha has faced criticism regarding its failure to meet the assumptions (e.g., Kline, Reference Kline2016; Sijtsma & Pfadt, Reference Sijtsma and Pfadt2021; Teo & Fan, Reference Teo and Fan2013). Alternative reliability coefficients are recommended in lieu of coefficient alpha, such as coefficients omega, coefficient theta, coefficient H, and the greatest lower bound (for details, see McNeish, Reference McNeish2018; Teo & Fan, Reference Teo and Fan2013). We suggest future studies in L2 reading assessments select alternative reliability coefficients based on their research objectives and data characteristics.
Conclusion
The present study determined the average reliability of L2 reading assessments’ Cronbach’s coefficients and recognized the number of test items, test piloting, test takers’ educational institution, study design, and testing mode as potential moderators explaining 31.53% of variance in the reliability coefficients of L2 reading comprehension tests across the studies.
The present study has important implications for researchers and practitioners in L2 reading assessment in terms of theoretical understanding of reliability and validity as well as empirical research design and test development. First, applied researchers are encouraged to assess and report reliability estimates for each application of a given test and tailor their research design to maximize score reliability of L2 reading assessments. Relatedly, as reliability can be affected by the quality of the research, instrument dimensionality, item characteristics, alignment with the intended construct, and coverage of the intended construct, it is advisable for future researchers to provide this information to improve transparency in L2 reading assessment research. Incorporating these variables into moderator analyses in future meta-analytic studies will further lead to a more thorough understanding of reliability in reading assessment. Second, given the limitations of Cronbach’s alpha, researchers could consider using alternative reliability coefficients to achieve more precise measurement outcomes in reading assessment. Third, in test development, it is important to consider the number of test items, test piloting, test takers’ educational institution, study design, and testing mode when devising L2 reading assessments. A well-balanced number of items ensures that the test covers the reading skills being assessed, without causing fatigue or disengagement or without hampering the precision of the test. Additionally, piloting the test with a representative sample can help identify potential issues in the test items to ensure validity and reliability of the final version of the test. Careful attention to the testing mode, whether digital or print, is also crucial, as it can influence test takers’ reading assessment outcomes. Finally, and importantly, increased training, along with rigorous standards, is essential for L2 researchers, teachers, and test developers regarding the understanding and application of reliability. Our study provides further evidence for insufficient reporting and knowledge concerning psychometric features of instruments among L2 researchers and practitioners, as discussed by other researchers (e.g., Plonsky and Derrick, Reference Plonsky and Derrick2016). Therefore, we also advocate for comprehensive and systematic training programs by researcher trainers for L2 researchers, teachers, and test developers with respect to instrumentation knowledge. Additionally, more stringent requirements, particularly concerning reliability estimates, by journal reviewers and editors may also possibly foster enhanced practices among L2 researchers. It is hoped that this study will contribute to the existing knowledge in the field, inspire further research, and provide insights for future applications.
Supplementary material
The supplementary material for this article can be found at http://doi.org/10.1017/S0272263124000627.
Data availability statement
The experiment in this article earned Open Materials badges for transparent practices. The data used in this study and the supplemental materials are available from the following link: https://osf.io/2p6vd/.
Acknowledgments
We would like to express our gratitude to the Singapore Ministry of Education (MOE) and the Chinese Scholarship Council for providing the first author with an MOE Scholarship (August 2022 intake) to study for a master of arts degree in applied linguistics at the National Institute of Education, Nanyang Technological University. We also extend our thanks to the editor and anonymous reviewers for their insightful comments and valuable suggestions. Additionally, we are grateful to the Sichuan International Studies University Academic Research Fund (grant ID: SISU202414) for partially supporting this research. Some parts of the text have been revised using artificial intelligence to improve clarity.
Appendix
Journal list (* indicates newly added reading journals)
-
1. AILA Review
-
2. Applied Linguistics
-
3. Applied Linguistics Review
-
4. Asian ESP Journal
-
5. Assessing Writing
-
6. Bilingual Research Journal
-
7. Bilingualism: Language and Cognition
-
8. CALICO Journal
-
9. Canadian Modern Language Review
-
10. Chinese Journal of Applied Linguistics
-
11. Computer Assisted Language Learning
-
12. ELT Journal
-
13. English for Specific Purposes
-
14. English Teaching
-
15. Foreign Language Annals
-
16. Innovation in Language Learning and Teaching
-
17. International Journal of Applied Linguistics
-
18. International Journal of Bilingual Education and Bilingualism
-
19. International Journal of Bilingualism
-
20. International Journal of Multilingualism
-
21. International Multilingual Research Journal
-
22. IRAL - International Review of Applied Linguistics in Language Teaching
-
23. Iranian Journal of Language Teaching Research
-
24. ITL - International Journal of Applied Linguistics (Belgium)#
-
25. JALT CALL Journal
-
26. Journal of Asia TEFL
-
27. Journal of English for Academic Purposes
-
28. Journal of Multilingual and Multicultural Development
-
29. Journal of Reading Behavior*
-
30. Journal of Research in Reading*
-
31. Journal of Second Language Writing
-
32. Language Acquisition
-
33. Language Assessment Quarterly
-
34. Language Awareness
-
35. Language Learning
-
36. Language Learning and Technology
-
37. Language Learning Journal
-
38. Language Teaching
-
39. Language Teaching Research
-
40. Language Testing
-
41. Language Testing in Asia
-
42. Linguistic Approaches to Bilingualism
-
43. Modern Language Journal
-
44. Reading and Writing
-
45. Reading and Writing Quarterly
-
46. Reading Psychology*
-
47. Reading Research Quarterly*
-
48. ReCALL
-
49. RELC Journal
-
50. Research in the Teaching of English
-
51. Scientific Studies of Reading*
-
52. Second Language Research
-
53. Studies in Second Language Acquisition
-
54. Studies in Second Language Learning and Teaching
-
55. Study Abroad Research in Second Language Acquisition and International Education
-
56. System
-
57. Teaching English with Technology
-
58. TESOL International Journal
-
59. TESOL Journal
-
60. TESOL Quarterly
-
61. The Reading Teacher*