1. Introduction
Data-driven learning (DDL) involves researcher-like inductive explorations of language use, and was described by Johns (Reference Johns1991) as “the attempt to cut out the middleman as far as possible and to give the learner direct access to the data” (p. 30). It has been a long-standing focus of language learning research and has been attested to be useful in guiding language learners to “explore language corpora and come to their own conclusions” (Boulton, Reference Boulton and Goźdź-Roszkowski2011: 575). By providing second language (L2) learners with a large amount of “naturally-occurring language” (Boulton, Reference Boulton2009a: 37), the DDL approach entails a range of activities where learners are not taught by traditional, often teacher-centred, deductive approaches, but are encouraged to explore corpus data independently and identify patterns of language use. This can enable learners to discover language patterns through authentic language data (Boulton & Cobb, Reference Boulton and Cobb2017), which can effectively enhance students’ learning motivation, engagement and autonomy (Gilquin & Granger, Reference Gilquin, Granger, O’Keeffe and McCarthy2010). A number of review studies on DDL research have been conducted, which have contributed valuable insights into the development of DDL studies. Nevertheless, the high-impact publications, main research venues and developmental stages in the DDL field still remain to be explored.
With a view to addressing these omissions, this study employs bibliometric analysis to map the studies on DDL over the period 1994–2021 in terms of the common research themes, high-impact publications, main research venues, the developmental stages, and the latest transformative publications in this field. Bibliometric analysis is a statistical analysis of datasets comprising literature published within a specific time period (Pritchard & Wittig, Reference Pritchard and Wittig1981). The following research questions (RQs) guide our analysis:
-
1. What are the common research themes, high-impact publications and main research venues in the DDL field?
-
2. Using Shneider’s (Reference Shneider2009) model of evolutionary stages, what developmental stages can be identified in the DDL field over time?
-
3. What publications can be identified by structural variation analysis (SVA) as having potentially high impact?
2. Literature review
2.1 Reviews in DDL
Numerous studies have endeavoured to synthesize DDL research and have drawn attention to specific topics in the field. For instance, Chambers (Reference Chambers, Hidalgo, Quereda and Santana2007) investigated 12 empirical studies from the 1990s onwards to explore learners’ corpus consultation and stressed the importance of evidence for assessing the effectiveness of DDL. Boulton (Reference Boulton and Edwardes2008) analysed 39 DDL papers and identified a primary focus on learners’ interactions with and attitudes toward DDL. Boulton (Reference Boulton, Moreno Jaén, Serrano Valverde and Calzada Pérez2010b) further surveyed 27 empirical research studies on students’ learning outcomes and found a scarcity of research investigating variables such as learners’ motivation and attitudes. Also, Boulton and Tyne (Reference Boulton and Tyne2013), in their critical review, pointed out the need for classroom practice and collaboration between researchers and practical instructors. From a chronological view, Boulton (Reference Boulton2017) offered a research timeline throughout the existence of DDL.
Another notable focus has been on a particular aspect of DDL. Yoon (Reference Yoon2011) examined the use of DDL in writing classes and concluded that concordancing exercises are useful for L2 writers. Boulton (Reference Boulton, Boulton, Carter-Thomas and Rowley-Jolivet2012) reviewed 20 empirical studies of corpus use in English for specific purposes (ESP) and found that corpora can be used as effective learning tools and reference resources. In a review of 18 empirical studies of DDL in L2 writing, Luo and Zhou (Reference Luo and Zhou2017) identified the great potential of DDL activities in L2 writing classes, but they also found that the use of corpora was not superior to traditional tools when used as a reference tool. Chen and Flowerdew (Reference Chen and Flowerdew2018) synthesized 37 empirical studies in academic writing in terms of their main application and called for more studies to expand this field. The third strand of systematic reviews involves the construction of a corpus comprising DDL studies and the identification of common themes by using software or corpus-based analysis. Pérez-Paredes (Reference Pérez-Paredes2022) examined the utilisation of DDL by compiling journal articles from 2011 to 2015 into a corpus, and found that the topics of syllabus integration and teacher training are rarely discussed in DDL. In the latest study, Boulton and Vyatkina (Reference Boulton and Vyatkina2021) conducted a large-scale and systematic corpus-based analysis of DDL studies and identified the publication scope, research themes, and future directions of DDL research.
Meta-analyses of DDL research have also been undertaken. For instance, Mizumoto and Chujo’s (Reference Mizumoto and Chujo2015) examination of the effectiveness of DDL for learning lexico-grammatical items provided support for the use of DDL for vocabulary acquisition. The meta-analyses by Boulton and Cobb (Reference Boulton and Cobb2017) and Lee, Warschauer and Lee (Reference Lee, Warschauer and Lee2019) were identified as prominent publications in our bibliometric analysis and are discussed in detail in Section 4.4.
2.2 Bibliometrics
Bibliometrics is conceptualised as “the application of mathematical and statistical methods to books and other media of communication” (Pritchard, Reference Pritchard1969: 348). This quantitative approach employs bibliometric data from scientific databases such as Web of Science (WoS) and Scopus and has been used to identify research networks, research themes and research trends (Lei & Liu, Reference Lei and Liu2019). The use of scientific databases enables a comprehensive, structured and balanced coverage of literature (Birkle, Pendlebury, Schnell & Adams, Reference Birkle, Pendlebury, Schnell and Adams2020) by employing inclusive bibliometric data (e.g. number of publications and citations, occurrences of keywords, and references). This makes it possible to evaluate the impact of journals, authors, and publications and the productivity of institutions (Lei & Liu, Reference Lei and Liu2019). More recent bibliometric analyses have used customised software such as CiteSpace and VOSviewer to construct and visualise bibliometric maps. CiteSpace is an information visualisation software that can analyse and visualise trends and patterns in a field, and can facilitate various analyses, such as co-citation analysis, SVA, and collaboration networks (Chen, Reference Chen2012).
Previous enquiries in this line have applied bibliometric analysis to map out the development of research fields, such as applied linguistics or computer-assisted language learning (Chen, Zou, Xie & Su, Reference Chen, Zou, Xie and Su2021; Jung, Reference Jung2005; Liu & Zhang, Reference Liu and Zhang2021), L2 vocabulary acquisition (Meara, Reference Meara2012), corpus linguistics (Park & Nam, Reference Park and Nam2017), multilingualism (Lin & Lei, Reference Lin and Lei2020), English for academic purposes (EAP) (Hyland & Jiang, Reference Hyland and Jiang2021a), and ESP (Hyland & Jiang, Reference Hyland and Jiang2021b; Liu & Hu, Reference Liu and Hu2021). A recent application of this approach in DDL is He and Wei (Reference He and Wei2019), who investigated the role of corpora in EAP research from 2009 to 2018.
2.3 Evolutionary model
Shneider’s (Reference Shneider2009) four-stage evolutionary model was employed in this study to trace the development of DDL research. According to Shneider (Reference Shneider2009), the evolution of a scientific discipline can be mapped into four stages. The first stage is primarily concerned with introducing language (i.e. terms and concepts) to a field. The second stage tends to display a primary focus on the principal techniques and tools. The third stage focuses on broadening the existing focus of interest to new areas. Research at stage four typically involves codifying knowledge through reflective reviews, meta-analyses or textbook publications. Shneider’s (Reference Shneider2009) four-stage model has been employed in bibliometric analyses of various disciplines, including information science, engineering, and ESP (e.g. Chen, Reference Chen2017; Liu & Hu, Reference Liu and Hu2021). The review by Liu and Hu (Reference Liu and Hu2021) revealed three evolutionary stages of ESP, namely the “initial conceptualising stage” (1970s–1990s), “the maturing stage” (1990s–2000s), and “the flourishing stage” (2000s–).
3. Methodology
3.1 Data collection
The dataset was retrieved from the WoS core collection database. The search provided 412 articles (1994–2021), which were narrowed down to 126 by excluding papers irrelevant to DDL. The earliest publication in the dataset appeared in 1994, thus considered the starting point. A flowchart of detailed procedures and relevant descriptions is provided in Appendix A (available in supplementary material). The inclusion of the most recent studies (November 2021) in the dataset enabled an up-to-date analysis of DDL research to capture the latest citations.
Although our initial search in WoS focused primarily on research articles (citing papers), CiteSpace captures the cited papers in the references, which enables the inclusion of a wider range of publication types (e.g. dissertations, theses, book chapters, and meta-analyses). This thereby broadens the scope of the study and enables us to identify prominent or frequently co-cited publications from a wide range of document types in this field.
3.2 Co-citation analysis and SVA
Co-citation analysis is a common bibliometric approach used to measure the topic similarity between two or more documents. Co-citation is measured by “the frequency with which two or more publications are referenced in another publication” (Aryadoust, Zakaria, Lim & Chen, Reference Aryadoust, Zakaria, Lim and Chen2020: 2). If two documents are cited in one article, they are regarded as co-cited documents; the more co-citations two documents have, the greater their semantic relatedness. Highly co-cited pairs of publications grouped into the same cluster can display commonalities in research themes (Chen, Ibekwe-SanJuan & Hou, Reference Chen, Ibekwe-SanJuan and Hou2010). Co-citation counts can be used to generate a scientific map of knowledge in a field, which consists of clusters of co-cited publications. The identification of key research themes using co-citation contributes to understanding the evolution of common research themes in a field (Chen, Reference Chen2017).
Given that co-citation analysis relies heavily on citation counts, it may be intuitively presumed that citation counts would be affected by factors such as early online publication and open access. We checked the dataset in this study and identified eight preprints out of 126 articles. According to Craig, Plume, McVeigh, Pringle and Amin (Reference Craig, Plume, McVeigh, Pringle and Amin2007), the effect caused by the differing duration “diminishes with larger counting intervals” (p. 9), thus the influence of preprints on citation is marginal. Also, the co-citation analysis used in this study is measured by calculating the co-citation of two references. On the one hand, early access publications in citing articles do not influence the results, as WoS combines early access and published papers into one record. On the other hand, if one preprint or open-access article is highly cited, it does not influence the prominence of the theme identified in this study unless the text is repeatedly co-cited with another text. Even if a particular preprint or open-access article is cited highly in conjunction with other articles on the same topic, the validity of this co-citation analysis still remains unaffected as long as the cited text is related to the theme of the cluster scrutinised.
SVA is a predictive model operationalised as a function in CiteSpace (Chen, Reference Chen2012), which aims to determine the transformative potential of a new publication in a field (Sebastian & Chen, Reference Sebastian and Chen2021). The variation can be quantified based on information in the publication, mainly cited references. The higher the degree of variation, the more transformative a publication will be. Unlike co-citation analysis, which requires the information of accumulated co-citation counts, SVA is advantageous in assessing the transformative potential of ideas conveyed even in a very recent publication. This contributes to mitigating the chronological bias inherent in co-citation. By identifying studies with transformative potential, SVA can serve as a good indicator of potentially high-impact publications (regardless of their publication dates), and can thereby signal the direction of future research in a field. This methodological approach has been attested to be effective in identifying studies of high transformative potential, such as Nobel Prize-winning publications (Sebastian & Chen, Reference Sebastian and Chen2021).
3.3 Network generation and analysis
The following parameters were used to address the research questions raised in this study. First, the modularity (Q) index and average silhouette score were adopted to measure the quality of the network, following Chen et al. (Reference Chen, Ibekwe-SanJuan and Hou2010). The modularity score determines the clearness of boundaries between each pair of clusters, and high modularity scores signal the decomposition of recognisable clusters. The average silhouette is used to determine the quality of a clustering structure, and high average silhouette scores indicate the high reliability of the clusters in this study. When addressing RQ1, which concerns the common research themes and high-impact publications, the following three metrics, sigma (∑), betweenness centrality, and burst, were used. Sigma, a measure of a publication’s novelty, was mainly used to identify prominent publications. The distribution of co-citations from the dataset and the number of prominent publications in different journals were used to identify the main research venues. To address RQ2, we then mapped these clusters onto the evolutionary stages of a discipline identified in Shneider’s (Reference Shneider2009) model based on the defined qualities for each stage, including time frames, interrelations, and the embodiment of the characteristics specified in this framework. In order to find transformative publications in co-citation networks (RQ3), this study used centrality divergence (CKL) and the harmonic mean (H) scores. The detailed descriptions of the metrics in addressing the RQs are available in Appendix A, and the properties set in the analysis as well as screenshots of CiteSpace are elaborated in Appendix B (refer to supplementary material). Apart from the automatic analysis using CiteSpace, a manual analysis of the labels based on a close reading of the data source and the automatic labels in CiteSpace was conducted by two of the authors. To ensure the consistency of the coding, Cohen’s kappa was employed and the coefficient was found to be 0.93, indicating a high agreement in the labelling. Inconsistencies were resolved in a follow-up discussion.
4. Results and discussion
This section first presents the findings of RQ1 and RQ2. As the common research themes and high-impact publications are embedded in the evolutionary stages, this study integrates the first two research questions, namely common research themes, high-impact publications and research venues (RQ1) and the developmental stages (RQ2) in Sections 4.1–4.4. This is then followed by a detailed account of the latest transformative research in DDL (RQ3).
4.1 Baseline network interpretation
A network of 469 co-cited references and 1,793 co-citation links was created by CiteSpace from our bibliometric dataset. The modularity score of the network was 0.79, indicating clear boundaries between each pair of clusters; the high quality of clustering configuration was attested by the high silhouette score of 0.92 (see Figure A3 in Appendix A). Forty-five clusters were identified automatically, 11 of which contained more than 10 studies, and were thus worthy of further investigation.
Figure 1 presents the timeline view of clusters from a diachronic perspective of DDL. The time span of each main cluster is presented by separate horizontal lines. Each cluster is arranged horizontally with the direction of time from left to right. Clusters are sequenced in vertical order by size. When starting from the top of Figure 1 and moving down line by line, we can see the co-cited references in the main clusters. The larger the tree ring is, the more highly co-cited the publication is. Coloured lines represent the co-citation links between each pair of publications. A detailed illustration of publications in Clusters #0–#2 is presented in Figure C1 in Appendix C (presented in supplementary material).
Table 1 displays these main clusters, sequenced by the number of co-cited publications in each cluster, from the largest, Cluster #0, to the smallest, Cluster #10, as well as the oldest, Cluster #17. As can be seen, the mean year of publication varied from 2003 (Cluster #3) to 2020 (Clusters #5 and #8). The time period of each cluster’s activeness was determined by considering the publishing years of all studies in each cluster. The largest cluster, Cluster #0, labelled “Effectiveness of DDL”, was active for 10 years (2011–2021), and involved 81 co-cited publications. One prominent publication identified in this cluster is Boulton and Cobb (Reference Boulton and Cobb2017), a meta-analysis examining the effectiveness of DDL (elaborated in Section 4.4.1). The two smallest clusters, #9 and #10, “Corpus-based materials in pedagogy” and “Discipline-specific corpora”, each comprised 12 studies. Examples of these studies include Frankenberg-Garcia (Reference Frankenberg-Garcia2012) in Cluster #9, which re-tested the benefits of corpus-based examples for learners’ comprehension and the capacity of error correction; and Borja (Reference Borja, Anderman and Rogers2007) in Cluster #10, which provided an overview of translation-specific corpora in Spain for translators and translating researchers. The oldest major cluster is Cluster #3, “Teacher education in DDL”, which comprised 38 publications. The most recent clusters, #5 (“Pedagogical implications of DDL”) and #8 (“Language teachers’ lesson planning”), have a mean publication year of 2020 and are still evolving in 2021 (as indicated by the co-cited publications in 2021 of this cluster). Cluster #5 included a meta-analysis by Cobb and Boulton (Reference Cobb, Boulton, Biber and Reppen2015) focusing on the application of DDL in classrooms, and the review by Boulton (Reference Boulton2017) on the explicit use of corpora in L2 learning and teaching. A representative publication in Cluster #8 is Zareva (Reference Zareva2017), which surveyed L2 teachers’ attitudes towards the application of DDL in teaching grammar. It is interesting to note that the oldest cluster, Cluster #17, contained the earliest co-cited publication, the Brown Corpus of American English (Francis & Kučera, Reference Francis and Kučera1989), albeit this cluster containing only four studies. Also of note, more than one cluster may be active at a time. For instance, Clusters #0, #4 and #10 were all active in 2011, which indicates that a variety of themes are valued during the same time period. This overlap in the time frames of clusters can be explained by the non-linear development of scientific fields, in which more than one prominent research topic emerges simultaneously. Also, many studies combine several research themes; for example, Boulton (Reference Boulton2010a) in Cluster #4 examined both the effectiveness of DDL and low-level learners’ attitudes.
Table 2 displays four publications with the most recent bursts (until 2021). Typically, there is a gap between the publishing year and the burst year, which is called the post-publication lag. Taking Daskalovska (Reference Daskalovska2015) as an example, the starting year of the citation burst is 2016, one year after the publishing date. In some cases, however, the two years may coincide, such as Boulton and Cobb (Reference Boulton and Cobb2017) and Lee et al. (Reference Lee, Warschauer and Lee2019).
Based on the research focus of each cluster and Shneider’s (Reference Shneider2009) four-stage model, we identified the following three major stages in the development of the DDL field (RQ2): the conceptualising stage (1980s–1998), marked by the establishment of a new research object; the maturing stage (1998–2011), characterised by the development of research techniques and methods; and the expansion stage (2011–now), which features the application of instruments in new research domains and addresses new research questions, and the recent emergence of some features of Stage 4. But the dividing point between evolutionary stages is not clear-cut, and temporal overlap between adjacent stages may occur. For instance, the transition year of 2011 featured a theme focusing on the techniques and applications of those techniques. This can be explained by the inherently non-linear development of disciplines.
4.2 Stage 1: The conceptualising stage (1980s–1998)
Like other scientific disciplines, the feature of the first stage in DDL research is represented as the introduction of “new objects and phenomena” to signal the emergence of a certain discipline (Shneider, Reference Shneider2009: 217). Typical for the first stage is the coinage of new terminology to describe the subjects in a field. The term “data-driven learning” dates from Johns (Reference Johns1991), while prior to this various other terms had been used, including classroom concordancing, the microcomputer-based approach to foreign language learning (Johns, Reference Johns, Bongaerts, de Haan, Lobbe and Wekker1988), concordancing (Bloch, Reference Bloch2009) and corpus-based learning (Cobb & Boulton, Reference Cobb, Boulton, Biber and Reppen2015). Johns (Reference Johns1991) proposed that language learners use concordancers to explore authentic language data. Other publications from this first stage include Murphy (Reference Murphy1996), which reported the use of DDL to assist in vocabulary learning, and Kita and Ogata (Reference Kita and Ogata1997), which reported the use of DDL for the acquisition of collocation knowledge. However, the literature in Stage 1 was not identified in the co-citation network, which can be explained by the low level of co-citations of articles from this stage. According to Shneider (Reference Shneider2009), first-stage research, while usually creative and inventive, often possesses methodological weaknesses or inaccuracies, and is thus usually less cited than studies in the subsequent stages.
4.3 Stage 2: The maturing stage (1998–2011)
The most significant feature of Stage 2 concerns the creation of “a toolbox of methods and techniques” (Shneider, Reference Shneider2009: 217). The most representative clusters in this stage are #1, #2, #6, #9 and #10 (see Table 1), and are characterised by tool-centred publications. An additional significant feature entails a deeper analysis of the field. Part of Clusters #1, #2, #4, #6 and #7 are dominated by learner-centred publications, which reflect primary investigations on the effectiveness of DDL and its implementation. The primary focus on the implementation of DDL in this stage complies with the characteristics of Stage 2 defined by Shneider (Reference Shneider2009).
4.3.1 Tool-centred research
The most salient theme in Stage 2 is concerned with tool-centred publications, which focus on developing specific types of tools and techniques, including software and corpora, for different groups of learners. Clusters #1, #2, #6, #9 and #10 (see Table 1) are representative of such themes. Of these, Cluster #1 is the largest, with 46 publications. One prominent publication in Cluster #1 is Bloch (Reference Bloch2009) (∑: 1.07), which designed the interface for a web-based concordancing program for academic writing. Although no high-impact publications were identified in the remaining clusters, the manual analysis revealed the dominance of tool-centred approaches to language teaching and learning. Anthony (Reference Anthony2004) in Cluster #2 focused on the update of the corpus toolkit, AntConc, to assist in corpus building and analysis. Davies (Reference Davies2008), with the Corpus of Contemporary American English (COCA) in Cluster #9, and Burnard (Reference Burnard2004), with the BNC Baby Corpus in Cluster #6, were identified as two important corpora. This is in line with previous findings of Park and Nam (Reference Park and Nam2017), who found that COCA is the most cited in DDL. However, this work was not identified as a prominent publication here, possibly due to the relatively low co-citation with other studies on a similar theme, as it is possible that researchers may draw on a single tool or source in their empirical analyses. Other publications involve the indexing system by Köhler, Philippi, Specht and Rüegg (Reference Köhler, Philippi, Specht and Rüegg2006) and the accuracy of part-of-speech tagging in corpora by Coden, Pakhomov, Ando, Duffy and Chute (Reference Coden, Pakhomov, Ando, Duffy and Chute2005) in Cluster #10, and the use of the English Interview Corpus in language teaching by Braun (Reference Braun2005) in Cluster #2.
4.3.2 Learner-centred research
Another notable theme in Stage 2 is learner-centred research, as shown in Cluster #1. The principal focus of this cluster is the introduction of DDL in language learning (e.g. Kennedy & Miceli, Reference Kennedy and Miceli2010) and the role of corpus consultation in language learning (e.g. O’Sullivan, Reference O’Sullivan2007). Kennedy and Miceli (Reference Kennedy and Miceli2010), a prominent publication (∑: 1.16), evaluated corpora as an aid to creative writing among intermediate-level language learners. O’Sullivan (Reference O’Sullivan2007), the second prominent publication (∑: 1.15), investigated the role of corpus consultation in process-oriented learning. The third prominent publication conducted by Vannestål and Lindquist (Reference Vannestål and Lindquist2007) (∑: 1.08) integrated the use of corpora into university English grammar courses.
Three prominent publications were identified in Cluster #4, all authored by Boulton (Reference Boulton2010a, 2009a, 2009b). Boulton (Reference Boulton2010a), with the highest sigma value of 2.59 in this stage, tested the assumption that DDL is unsuitable for low-level learners. The study demonstrated the effectiveness of paper-based concordance materials (teacher prepared) for low-level learners, thereby eliminating the cognitive burden presented by the use of software and computers. Boulton (Reference Boulton2009a) (∑: 1.27) provided evidence for the suitability of DDL among low-level learners, and Boulton (Reference Boulton2009b) (∑: 1.14) attempted to popularise DDL in language learning classrooms.
The theme of learner-centred studies is also evident in Clusters #2 and #6. Chambers (Reference Chambers, Hidalgo, Quereda and Santana2007), a prominent publication (∑: 2.08), was an early attempt to synthesize DDL studies, while Cresswell (Reference Cresswell, Hidalgo, Quereda and Santana2007) (∑: 1.15) found that learning styles affect learning outcomes, both of which are book chapters. Similarly, Chambers (Reference Chambers2005), a prominent publication in Cluster #6 (∑: 1.22), reported that individual differences (like learning styles and motivation) influence the success of DDL activities.
Cluster #7 also displays a primary focus on the learner-centred theme. Yoon (Reference Yoon2011) has a strong citation burst (∑: 1.45), and reviews 12 empirical studies that focus on the effectiveness and evaluation of DDL for L2 writing. Yoon noted that learners’ acquisition of linguistic knowledge in writing and their autonomy can be facilitated by DDL, and pointed out the importance of studies focusing on teacher training and classroom implementations, as well as variables that affect learners’ behaviours and learning outcomes.
4.4 Stage 3: The expansion stage (2011–now)
Conforming to Shneider’s (Reference Shneider2009) four-stage model, studies in the third stage tend to apply the methods and techniques developed in the second stage to address new problems in different domains, such as speaking competence in Cluster #0 (Geluso & Yamaguchi, Reference Geluso and Yamaguchi2014). Thus, Stage 3 represents the theme of “expansion” in this field. Studies in this stage primarily focused on the application of DDL to a broader range of domains. The expansion stage included part of Clusters #1, #4, #7 and #9, plus the intact Clusters #0, #5 and #8. The emergence of themes such as “variables affecting DDL”, in Cluster #7, and “language teachers’ lesson planning”, in Cluster #8, is illustrative of the focus on new subjects and phenomena. Two main focal points of third-stage publications on DDL were identified, namely, effectiveness-centred and pedagogy-centred research.
4.4.1 Effectiveness-centred publications
Effectiveness-centred publications concentrated on the impact of DDL on learning outcomes, which are primarily determined by quantitative methods such as tests. Table 3 presents the studies with a strong focus on the effectiveness of DDL for learning collocations. As can be seen, two main clusters contained publications related to the effectiveness of DDL. Publications addressing the effectiveness of DDL were the most prominent in Cluster #0, among which Boulton and Cobb (Reference Boulton and Cobb2017) is the publication with the highest sigma across this stage. In this study, the authors undertook a meta-analysis to measure the effectiveness of DDL for language acquisition, and concluded with a call for longitudinal research and the incorporation of delayed post-testing. Another prominent publication, Smart (Reference Smart2014) examined the effectiveness of paper-based DDL for English as a Second Language (ESL) grammar and found more effective learning outcomes of inductive learning with printed corpus-based materials than deductive corpus-based and traditional approaches.
Huang (Reference Huang2014) focused on patterns of abstract nouns in L2 writing; Daskalovska (Reference Daskalovska2015) concentrated on verb–adverb collocations, and Vyatkina (Reference Vyatkina2016) analysed verb–preposition collocations. These studies reached a consensus on the positive role that DDL plays in facilitating learners’ acquisition of collocations. Although Frankenberg-Garcia (Reference Frankenberg-Garcia2014) was not identified as a prominent publication, a close examination revealed that it enjoys high impact, as indicated by its high co-citation frequency (8). This paper examined the impact of corpus-based examples on language comprehension and production, and found improvements in learners’ awareness of grammatical properties. In Cluster #8, Lee et al. (Reference Lee, Warschauer and Lee2019), a meta-analysis of the effectiveness of DDL for L2 vocabulary acquisition and the variables affecting learning outcomes, possessed the second highest sigma score.
4.4.2 Pedagogy-centred publications
Pedagogy-centred publications feature primarily in Clusters #0, #4 and #7. Unlike effectiveness-centred research, which typically measures the effectiveness of DDL through tests, pedagogy-centred research has a strong focus on the implementation of DDL in classroom environments (e.g. Charles, Reference Charles, Leńko-Szymańska and Boulton2015; Flowerdew, Reference Flowerdew2012), the learners’ perceptions of DDL (e.g. Charles, Reference Charles2014; Geluso & Yamaguchi, Reference Geluso and Yamaguchi2014), and factors influencing its implementation (e.g. Cotos, Reference Cotos2014).
Geluso and Yamaguchi (Reference Geluso and Yamaguchi2014) in Cluster #0 presented a curriculum design focusing on spoken fluency and surveyed students’ attitudes towards DDL (co-citation frequency: 8). Cotos (Reference Cotos2014), a frequently co-cited publication in Cluster #0 (with a co-citation frequency of 7), focused on the role of corpora in students’ language learning by comparing their interactions with a local learner corpus and a native-speaker corpus. Charles (Reference Charles2014), also co-cited seven times in Cluster #0, conducted qualitative research on the use of self-built corpora from a longitudinal perspective. Pérez-Paredes, Sánchez-Tornel and Calero (Reference Pérez-Paredes, Sánchez-Tornel and Calero2012), a frequently co-cited publication in Cluster #4 (co-cited five times), examined learners’ search strategies in DDL activities. Flowerdew (Reference Flowerdew2012), the most representative and most co-cited publication in Cluster #7, focused on applications of DDL in classrooms, and discussed the impediments to DDL in pedagogy and the pedagogical application of corpora.
Additionally, the analysis of the labels shows that DDL studies have displayed some features of the fourth stage. For instance, Cluster #5 (“Pedagogical implications of DDL”) and #8 (“Language teachers’ lesson planning”) reflect the emerging Stage 4 in the DDL field. Vyatkina (Reference Vyatkina2020) and Chambers (Reference Chambers2019) in Cluster #5, as well as O’Keeffe (Reference O’Keeffe2021) in Cluster #8, agreed on the positive impact of DDL on learning outcomes and the advantages of DDL practices in various contexts. The need for theoretical underpinnings from the area of second language acquisition was also emphasised by O’Keeffe (Reference O’Keeffe2021) and Lee et al. (Reference Lee, Warschauer and Lee2019). These calls conform to the features of Stage 4 that involve broader applications of knowledge generated in the first three stages for various practical purposes (Shneider, Reference Shneider2009). Another notable feature of the fourth stage is the publication of meta-analyses and reviews (Shneider, Reference Shneider2009), and several examples have evidenced the emergence of the fourth stage (e.g. Boulton & Vyatkina, Reference Boulton and Vyatkina2021; Lee et al., Reference Lee, Warschauer and Lee2019). However, current research has still not fully addressed the role played by variables in DDL such as the relative explicitness of instruction and cognitive learning processes (Chambers, Reference Chambers2019), which indicates the ongoing Stage 3. Thus, the current research status characterises the end of Stage 3 and the beginning of Stage 4.
Regarding the main research venues in RQ1, this study carried out a journal co-citation analysis to identify the most frequently co-cited journals in DDL. Table 4 lists the top 10 journals, sequenced by co-citation frequency. Among them, Computer Assisted Language Learning (96), ReCALL (87) and Language Learning & Technology (76) are identified as the most frequently co-cited journals in the field, and are thus the main venues for DDL research. This corresponds to the common aim of all three journals, which is to encourage technology-mediated language learning and teaching, especially those involving innovative practices. Other prominent journals include System and Applied Linguistics, with a co-citation count of 74 and 64, respectively. Figure C2 in Appendix C displays a list of highly cited journals.
Finally, we calculated the impact of the journals by considering the number of prominent publications appearing there. As co-citation counts of one journal are closely associated with the number of publications, the distribution of burst publications offers a more objective view of influential journals. The analysis shows that the 19 burst publications across three stages are distributed unevenly across nine journals. These publications were published predominantly by ReCALL (seven papers), which indicates that ReCALL is the main source of prominent DDL studies, and a prominent repository for research on DDL. Other important publication venues are Language Learning & Technology (three papers) and Language Learning (two papers). There are also other journals that possess a single burst publication: Applied Linguistics, Computer Assisted Language Learning, English for Specific Purposes, Indian Journal of Applied Linguistics, and Journal of English for Academic Purposes. This indicates wide interest in DDL from other journals in applied linguistics.
4.5 Latest transformative research
To answer RQ3, an SVA was conducted to identify transformative research in the last three years (2019–2021) and predict future directions of DDL research (the specific results are shown in Table 5). The analysis identified seven transformative publications (sequenced by the CKL score, as introduced in Appendix A). Among these studies, five belong to empirical studies and the remaining two are reviews.
More specifically, two transformative studies focused on teacher education in DDL. The first one, Chen, Flowerdew and Anthony (Reference Chen, Flowerdew and Anthony2019), reported the success of a teacher training workshop that introduced corpus-based academic writing pedagogy to English teachers in Hong Kong. In the second study, Crosthwaite, Luciana and Wijaya (Reference Crosthwaite and Wijaya2021) examined the effectiveness of a DDL training program for teachers. This shows that teacher training is an urgent need and a prerequisite for large-scale classroom implementations of DDL, which is in line with the results in Chen and Flowerdew (Reference Chen and Flowerdew2018).
Three transformative studies examined different factors in DDL activities. Sun and Hu (Reference Sun and Hu2020) investigated the difference between paper- and computer-based corpus-informed exercises to support Chinese undergraduates’ acquisition of hedging in writing. Crosthwaite, Storch and Schweinberger (Reference Crosthwaite, Storch and Schweinberger2020) examined the effectiveness of DDL for learners’ resolution of errors, with consideration of different degrees of directness in the written corrective feedback provided by teachers. Crosthwaite, Wong and Cheung (Reference Crosthwaite, Wong and Cheung2019) identified corpus query and usage patterns based on actual data collected from an online corpus platform. This shows that current DDL studies display a predominant interest in implementing DDL in classroom practices. The investigation of variables affecting the effectiveness of DDL, such as the type of activities, the role of written corrective feedback, and learners’ query strategy in using corpora, are at the centre of current work.
Of interest is that two transformative review studies, Boulton and Vyatkina (Reference Boulton and Vyatkina2021) and Vyatkina (Reference Vyatkina2020), contribute to reporting similar DDL future development directions. Both publications point out that future studies may need to focus on the integration of DDL for teaching LOTEs (languages other than English), DDL practices among learners of different proficiency and age levels, and open-access resources of DDL integrated with user guides and exercise collections for specific corpora. Boulton and Vyatkina (Reference Boulton and Vyatkina2021) also emphasised the necessity of advancing theories in DDL and considering different forms of learner interaction with corpora (e.g. multimedia corpora with video and sound).
5. Conclusion
This study provided a diachronic and systematic review of the development of the DDL field by implementing a co-citation analysis, SVA and close manual analysis of a corpus of 126 publications collected from the WoS core collection. In addressing RQ1 (common themes, publications and venues) and RQ2 (developmental stages), this study identified 11 main clusters and 19 prominent publications, as well as three major evolutionary stages of DDL research (namely, the conceptualising stage, the maturing stage and the expansion stage). These stages represent a shift in academic interest from the establishment of techniques and testing the effectiveness of DDL for language acquisition to the implementation of DDL in classroom practice with consideration of a range of variables, with new features of Stage 4 emerging. Current interest involves more nuanced research results that incorporate different variables, the review of knowledge generated in the first three stages and the practical implementation of knowledge in this field. The results for research venues in RQ1 indicated that Computer Assisted Language Learning, ReCALL and Language Learning & Technology are the main venues for DDL research, while ReCALL is the most influential venue for DDL research in terms of prominent publications in this field. The findings from RQ3 (publications with potentially high impact) show that the main areas of future research are the implementation of DDL in classroom teaching and teacher training.
The analysis shows that researchers have reached a consensus that DDL plays a positive role in promoting learning outcomes; however, little is yet known about different variables inherent in various pedagogical approaches to DDL, and individual learner differences have only begun to be addressed. Therefore, future studies may consider expanding inquiries in this line by including a more specific analysis, such as introducing DDL in classrooms, organising teacher training workshops, and examining the effect of variables (both activity and learner related) in DDL. More nuanced study designs are needed to assess DDL in different pedagogical contexts with different levels of learners.
Despite the advantages inherent in a large-scale bibliometric analysis, limitations in this approach need to be recognised. First, co-citation analysis presents a bias that favours older publications, as recent but potentially high-impact publications have had less opportunity to be cited. In this study, although SVA was employed to mitigate this bias, the analysis was used to identify the studies of transformative potential. It thus could not fully address the inherent problem of co-citation analysis in failing to compensate for recent publications. Future approaches may need to consider solving this issue by assigning greater weighting (e.g. through a weighting algorithm or normalisation) to recent papers to mitigate the chronological bias. Similarly, regarding the influence of open access on citations, 3 out of 17 impactful research articles in our dataset were found to be open access. Although it is beyond the scope of this study to examine the relationship between citation counts and relevant factors such as open access, it is certainly of value and interest to explore this issue in the future. Second, we acknowledge the Matthew effect, according to which well-known authors are more likely to be cited than less well-known authors. While the approach uses co-citations to measure the importance of individual studies, future studies may take into account other factors related to citation practice. Third, it is necessary to point out that the analysis in this study is based on articles in the core collections of WoS and their reference lists. This might undermine the impact of some highly cited or influential publications that are not indexed in the WoS core collection or not co-cited in their reference lists. For example, influential publications by Johns (Reference Johns1991) and Davies (Reference Davies2008) were not identified as highly co-cited studies in the co-citation analysis, possibly due to their relatively low co-occurrence with studies on similar themes despite their high impact as a single study. Another potential explanation for why tools like COCA and AntConc were not identified as prominent publications is inappropriate citation. Some authors might use AntConc as a tool without citation or variously cite one of the several papers or different versions of one software. Thus, caution may be needed when using the result of co-citation solely to gauge the influence of a study. Future bibliometric analyses should thus include literature from various academic platforms and a wider range of document types (e.g. dissertations, theses, book chapters, and meta-analyses). Additionally, although the bibliometric analysis based on the labels automatically generated from the citing articles and their references using CiteSpace can produce stable results for DDL studies between 1994 and 2021, further bibliometric analyses may be needed to explore new research themes, future developmental stages, and prominent publications to keep abreast of the evolving landscape of DDL.
Employing a bibliometric approach, this study provided a comprehensive picture of the development of DDL with respect to its developmental stages, the state-of-the-art, common research themes, high-impact publications, research venues and potential research directions. There is a clear need for bibliometric studies to analyse further or more detailed aspects of DDL, such as author co-citation and collaborations across regions. Future bibliometric studies (particularly those that employ CiteSpace) may compare their results with this study to identify potential changes in research direction. Researchers can also use CiteSpace to familiarise themselves with existing knowledge or identify the latest trends in a new field.
Supplementary material
To view supplementary material referred to in this article, please visit https://doi.org/10.1017/S0958344022000222
Acknowledgements
We would like to acknowledge our appreciation for the support received from the National Social Science Foundation of China (No. 21FYYB052) and Social Science Foundation of Shaanxi Province (No. 2020K025).
Ethical statement and competing interests
The research was conducted in accordance with Shandong University’s code of conduct for responsible research. No ethical issues exist in this study. The authors declare no competing interests.
About the authors
Jihua Dong is Professor, Qilu Young Scholar, and Taishan Young Scholar at Shandong University, China. Her research interests include corpus linguistics, data-driven teaching, and academic writing. She has published in English for Specific Purposes, International Journal of Corpus Linguistics, Journal of English for Academic Purposes, and System, among others.
Yanan Zhao is a PhD student in the School of Foreign Languages and Literature at Shandong University. She has obtained a master’s degree in languages and linguistics from the University of Melbourne, Australia. Her research interests include data-driven learning, second language learning and teaching, and corpus linguistics.
Louisa Buckingham lectures in applied linguistics at the University of Auckland. She has published on corpus-informed discourse analysis, language learning and sociolinguistics. She has published in various journals, including TESOL Quarterly, System, Journal of English for Academic Purposes, English for Specific Purposes, and Journal of Multilingual and Multicultural Development.
Author ORCIDs
Jihua Dong, https://orcid.org/0000-0001-7864-2319
Yanan Zhao, https://orcid.org/0000-0001-6840-9580
Louisa Buckingham, https://orcid.org/0000-0001-9423-0664