Definite and indefinite article accuracy in learner English: A multifactorial analysis

Kateryna Derkach; Theodora Alexopoulou

doi:10.1017/S0272263123000463

Definite and indefinite article accuracy in learner English: A multifactorial analysis

Published online by Cambridge University Press: 13 October 2023

Kateryna Derkach

and

Theodora Alexopoulou

Show author details

Kateryna Derkach*: Affiliation:
University of Cambridge, Cambridge, England
Theodora Alexopoulou: Affiliation:
University of Cambridge, Cambridge, England
*: Corresponding author: Kateryna Derkach; Email: [email protected]

Article contents

Abstract
Introduction
Background
Research objectives
Methodology
Analysis and results
Discussion
Conclusion
Supplementary material
Data Availability Statement
Competing interest
Footnotes
References

Rights & Permissions

Abstract

We present a learner corpus-based study of English article use (“a”/“the”/Ø) by L2 learners with four typologically distinct first languages (L1s): German and Brazilian Portuguese (both have articles), Chinese and Russian (no articles). We investigate several semantic and morphosyntactic factors—for example, specificity, prenominal modification that can affect article use. Our analysis of 660 written scripts from the Education First Cambridge Open Database confirms the lower overall accuracy of learners with no-article L1s. Our main finding is the differential effect of specificity on definite and indefinite articles: learners tend to associate specificity with “a,” which implies article omission with nonspecific indefinite singulars and overuse of “a” with specific indefinite mass nouns. Prenominal modifiers further contribute to perceived specificity, leading to article overuse with modified indefinite mass nouns. However, in definite contexts, prenominal modifiers are associated with increased article omission.

Type: Research Article
Information: Studies in Second Language Acquisition , Volume 46 , Issue 3 , July 2024 , pp. 710 - 740

DOI: https://doi.org/10.1017/S0272263123000463 [Opens in a new window]
Creative Commons: This is an Open Access article, distributed under the terms of the Creative Commons Attribution licence (http://creativecommons.org/licenses/by/4.0), which permits unrestricted re-use, distribution and reproduction, provided the original article is properly cited.
Open Practices: Open data
Copyright: © The Author(s), 2023. Published by Cambridge University Press

Introduction

Articles present persistent challenges for second language (L2) English learners (henceforth, learners), even at near-native level (DeKeyser, Reference DeKeyser2005; Murakami & Alexopoulou, Reference Murakami and Alexopoulou2016a). Articles are among the highest frequency English words, with “the” always topping frequency lists (Leech et al., Reference Leech, Rayson and Wilson2001). However, bare nominals (i.e., those not preceded by an article, henceforth labeled Ø contextsFootnote ¹) can account for approximately 50% of all nominals in native English (Master, Reference Master1997). Therefore, learners may notice the frequently occurring articles in English but struggle to understand why they often appear to be omitted.

Article use is influenced by learners’ first language (L1) and proficiency and also the linguistic characteristics of nominals: morphosyntactic (number, countability) and semantic (definiteness, specificity) features, article discourse functions.Footnote ² Many studies have demonstrated the relevance of such features (Ionin et al., Reference Ionin, Ko and Wexler2004; Liu & Gleason, Reference Liu and Gleason2002; Robertson, Reference Robertson2000; Snape, Reference Snape2008; Trenkic, Reference Trenkic2007), but none have considered these features together, so their relative importance and interactions are unknown. We aim to fill this gap and build a more comprehensive picture of learner article use.

Background

Articles in English and other languages

Languages vary considerably with respect to articles and their use (Lyons, Reference Lyons1999). Many major world languages lack articles (e.g., most Slavic languages, Hindi, Japanese), and some only have the definite (Arabic, Hebrew) or the indefinite (Turkish) article (Dryer, Reference Dryer, Dryer and Haspelmath2013).

In English, the distribution of the three options (“the,” “a,” and Ø) is determined by semantic and morphosyntactic factors. Whereas the definite/indefinite distinction is semantic, the use of “a” (vs. Ø) is based on countability and number—namely, “a” is required for count singulars. Nevertheless, count singulars appear bare in English in such frequently used expressions as “go to school/hospital,” “travel on foot/by car/plane,” “stay in bed/at home.” Thus, extracting patterns from article distribution in target English creates challenges for learners.

The article systems of other languages ([+art], e.g., Germanic, Romance languages) may deviate from the English pattern—for example, some Romance languages do not allow bare mass nouns in argument position (“I eat meat”—“Je mange de la viande” in French).

Though definiteness is not grammaticalized in languages that have no articles, [-art], they often have notions akin to definiteness (e.g., familiarity, specificity), which can be expressed via lexical means (demonstratives “this/that”) or syntactically (word order). Additionally, definiteness may be linked to other discourse and nominal features in distinct ways—for example, in Chinese only definite and human referents can be marked as plural (Lardiere, Reference Lardiere2009).

Given the variation in article presence and the meanings articles grammaticalize across languages, articles are expected to present challenges for [-art] and potentially for [+art] learners.

Key semantic distinctions

Definiteness

The literature on definiteness is vast and outside our scope (Frege, Reference Frege, Geach, Black and Black1960; Hawkins, Reference Hawkins1978; Lyons, Reference Lyons1999; Russell, Reference Russell1905, inter alia). It suffices to state here that the definite article makes presuppositions of uniqueness and existence, whereas the indefinite article implicates nonuniqueness (Hawkins, Reference Hawkins1991; Heim, Reference Heim, Portner, Heusinger and Maienborn2019). In (1) the speaker asserts that the book exists and is uniquely identifiable by the speaker and the hearer. When the condition of uniqueness is not met, the indefinite article is used (2).

Whether uniqueness can be presupposed depends on the knowledge shared by the speaker and the hearer, available from the discourse and/or immediate or larger, pragmatic/cultural context. In (1) there may be only one book on definiteness in existence or known within the speaker and hearer’s context (presupposition based on context). Alternatively, the speaker and the hearer may have discussed a certain book on definiteness (presupposition based on discourse), although they may be aware of others. In contrast, in (2) the speaker assumes that neither common knowledge nor discourse can help the hearer to uniquely identify the book.

For mass/plural referents, uniqueness applies to the entire mass or set, respectively (Heim, Reference Heim, Portner, Heusinger and Maienborn2019, p. 36). Example (3) refers to all the applicants who failed their exams, who should be known (e.g., all the applications already received). However, in (4) any applicant satisfying the description has to be rejected, but the entire set is unknown (e.g., applications are still coming in). Similarly, in (5) the speaker implies that all the expired meat was discarded, whereas in (6) there may be some that was not.

Specificity

A property cutting across the definiteness/indefiniteness distinction is specificity (not directly encoded in English). As with definiteness, there is extensive literature on the topic. Despite the variation in terminology and criteria for defining specificity, one common feature across the different accounts is the “referential intention,” or the speaker’s communicative intention to refer to something they have in mind (von Heusinger, Reference von Heusinger, Portner, Heusinger and Maienborn2019).

In this article, we define a specific referent as one that refers to a certain entity that exists in the world and that the speaker has in mind. For example, “a book” in (2) is specific, as it is referring to a certain existing book. This definition is similar to Bickerton’s “specific reference” (Reference Bickerton1981). Note the main difference from the definition adopted in some influential literature on L2 articles (most notably Ionin et al., Reference Ionin, Ko and Wexler2004, and replications), which is that noteworthiness (i.e., whether the speaker deems the referent noteworthy for the discourse) is not an essential feature. Therefore, (2) is specific when introducing a new referent into the discourse (first-mention indefinite) even if the identity of the book is irrelevant for further discourse (not noteworthy).

Nonspecific readings become possible in semantically opaque contexts—that is, those involving opacity-creating operators, such as verbs of propositional attitude (e.g., “want,” “believe”), negation, questions, conditionals, modals, future, and intensional verbs (e.g., “look for”). In (7), the speaker may have a certain book in mind (specific reading) or may be satisfied with any book about definiteness (nonspecific reading).

The two possible readings are said to have wide scope (specific) or narrow scope (nonspecific). Such ambiguity can also occur in nonopaque contexts, as in (8) and (9) taken from Lyons (Reference Lyons and Lyons2009, p. 172).

In (8), “a student” refers to a certain student and is, thus, referential. In (9), “a student” has no fixed reference and is, thus, nonreferential. This referentiality distinction also applies to definites: in (10) the winner is known, unique, so the nominal is referential, whereas in (11) there is usually one winner, but their identity is unknown because the competition has not finished yet, so “the winner” has no fixed reference.

Our choice of the term “specificity” rather than “referentiality” is based on Lyons’s suggestion to use “specificity” as an “informal cover term” (Reference Lyons1999, p. 173) to include both the wide/narrow scope and the referential/nonreferential distinctions.

To summarize, our definition of specificity is looser than that in some literature on L2 articles. The only essential criteria for specificity here are that the nominal refers to an existing (in a general sense) entity and that the reference is fixed.

L2 article accuracy: Previous research

L1

Studies comparing L2 English article accuracy by [+art] and [-art] learners unequivocally suggest that the latter have significantly more difficulties. This has been observed in naturalistic spoken data (Master, Reference Master1987; Thomas, Reference Thomas1989), gap-fill tasks (Hawkins et al., Reference Hawkins, Al-Eid, Almahboob, Athanasopoulos, Chaengchenkit, Hu, Rezai, Jaensch, Jeon, Jiang, Leung, Matsunaga, Ortega, Sarko, Snape and Velasco-Zárate2006; Ionin et al., Reference Ionin, Zubizarreta and Maldonado2008; Reid et al., Reference Reid, Battaglia, Schuldt, Narita, Mochizuki and Snape2006; Snape, Reference Snape2008), and in a large-scale corpus-based study with various [+/-art] learners (Murakami & Alexopoulou, Reference Murakami and Alexopoulou2016a).

Semantic features: Specificity

Early studies documented increased use of “the” in [+specific;-definite] contexts (following Bickerton’s definition), such as first-mention indefinites (Huebner, Reference Huebner1985; Thomas, Reference Thomas1989). Butler (Reference Butler2002) found evidence of using “the” with [+specific;-definite] referents (following Bickerton) in L1-Japanese learners (12). Additionally, lower-level learners often used “the” in the presence of prenominal modifiers, which they thought indicated specificity (in an informal sense).

Hua & Lee (Reference Hua, Lee, Dekydtspotter, Sprouse and Liljestrand2005) found L1-Chinese learners tended to accept ungrammatical bare count singulars which were nonspecific—for example, “Computer is an electronic device,” as opposed to specific “Computer stands on the top of the office desk,” in subject position, regardless of nouns’ concreteness/abstractness.

Ionin et al. (Reference Ionin, Ko and Wexler2004) hypothesized that [-art] learners interpret articles as markers of both definiteness and specificityFootnote ³ or of specificity alone demonstrating a fluctuation pattern. In forced-choice elicitation (gap-fill), L1-Russian and L1-Korean participants used “the” more with [+specific;-definite] (13) vs. [-specific;-definite] (14) referents and used “a” more with [-specific;+definite] (15) vs. [+specific;+definite] (16) referents.

Ionin et al.’s (Reference Ionin, Ko and Wexler2004) replications showed similar patterns in [-art] learners (Japanese: Hawkins et al., Reference Hawkins, Al-Eid, Almahboob, Athanasopoulos, Chaengchenkit, Hu, Rezai, Jaensch, Jeon, Jiang, Leung, Matsunaga, Ortega, Sarko, Snape and Velasco-Zárate2006; Reid et al., Reference Reid, Battaglia, Schuldt, Narita, Mochizuki and Snape2006; Mandarin-Chinese: Snape, Reference Snape, Snape, Leung and Smith2009; Trenkic, Reference Trenkic2008) and no fluctuation in [+art] learners (Spanish: García Mayo, Reference García Mayo, García Mayo and Hawkins2009; Ionin et al., Reference Ionin, Zubizarreta and Maldonado2008; Reid et al., Reference Reid, Battaglia, Schuldt, Narita, Mochizuki and Snape2006; Ting, Reference Ting2005; Greek: Hawkins et al., Reference Hawkins, Al-Eid, Almahboob, Athanasopoulos, Chaengchenkit, Hu, Rezai, Jaensch, Jeon, Jiang, Leung, Matsunaga, Ortega, Sarko, Snape and Velasco-Zárate2006). However, Ionin et al. (Reference Ionin, Zubizarreta and Maldonado2008) observed the effect in L1-Russians only in indefinite contexts—that is, (13) but not (15). Meanwhile, Ting (Reference Ting2005) did not find evidence of fluctuation in L1-Mandarin-Chinese learners. Snape et al. (Reference Snape, Leung, Ting, O’Brien, Shea and Archibald2006) suggest (based on Li & Thompson, Reference Li and Thompson1981) that this may be because Chinese is developing definite (demonstrative “nei”—“that”) and indefinite (numeral “yi”—“one”) articles.

To establish specificity—speaker intent to refer to a noteworthy item—Ionin et al. included explicit statements of speaker knowledge (ESK)—for example, “He is meeting with the director of his company. I don’t know who that person is” (Trenkic, Reference Trenkic2008, p. 13, emphasis added). By adding another type of test item (specific but denying speaker familiarity), Trenkic’s replication showed that learners tended to overuse “a” in definite [-ESK] contexts and overuse “the” in indefinite [+ESK] contexts regardless of specificity.

Trenkic (Reference Trenkic2007) alternatively suggests that [-art] learners misanalyze articles as adjectives (therefore, optional elements) taking “the” and “a” to mean “definite/identifiable” and “indefinite/unidentifiable,” respectively. In production data from L1-Serbians, she observed increased article omission with adjectives. Trenkic concluded that articles in [-art] learners are not syntactically motivated but produced as lexical items motivated by the pragmatic need to express the meaning learners assign to them. Based on Trenkic’s suggestions, we hypothesize that learners may erroneously assign the meaning “specific”/“nonspecific” to “the”/“a” in addition to or instead of the meaning “definite”/“indefinite”. Trenkic argues that prenominal modification correlates with article omission because producing lexical items drains cognitive resources leading to omission of less (communicatively) important items.

To summarize, naturalistic and elicited data from participants with various L1s have shown that [-art] learners might be uncertain about whether English articles signify specificity or definiteness, although the effect is not always observable (Hua & Lee, Reference Hua, Lee, Dekydtspotter, Sprouse and Liljestrand2005; Ting, Reference Ting2005; White, Reference White2003).

Nominal features

Number and countability play an important role in learners’ article use. Lardiere (Reference Lardiere, Brugos, Micciulla and Smith2004) suggests, based on a feature-assembly approach, that the inclusion of number and countability features is what makes the indefinite article more complex and difficult to acquire (also Hawkins et al., Reference Hawkins, Al-Eid, Almahboob, Athanasopoulos, Chaengchenkit, Hu, Rezai, Jaensch, Jeon, Jiang, Leung, Matsunaga, Ortega, Sarko, Snape and Velasco-Zárate2006). This is confirmed by Zhao and MacWhinney (Reference Zhao and MacWhinney2018) working within the competition model. Their L1-Chinese intermediate-advanced learners were more accurate on a cloze test with postmodified nominals, which are a cue for “the” (e.g., “the book which you recommended”), than with nominals without postmodifiers, which may be a cue for “a” (count singulars) or Ø (mass/plurals).

Trenkic’s (Reference Trenkic2002) study considered multiple nominal features, including number, countability, and abstractness. Lower-intermediate to advanced L1-Serbians were more accurate in obligatory “the” vs. obligatory “a” contexts (text-translation task) and showed evidence of “the-flooding” (Huebner, Reference Huebner1983), specifically using “the” instead of “a” with count-singular indefinites.

Additionally, Trenkic (Reference Trenkic2002) revealed that learners were more accurate in supplying “a” with abstract (e.g., “environment”) than with concrete (e.g., “book”) nouns. She explains the initially surprising pattern by hypothesizing that learners use “a” to individuate concepts with no clear boundaries (“fuzzy” concepts), such as abstract referents. This accounts for the higher “a” omission with concrete nouns, which do not require such individuation. Trenkic also suggests that “the” may be perceived as a marker of “definite” or concrete form, explaining its higher incidence with indefinite concrete count singulars. This is despite Trenkic’s initial expectation that learners would struggle more with “fuzzy” abstract referents. In comparison, Hua & Lee’s (Reference Hua, Lee, Dekydtspotter, Sprouse and Liljestrand2005) L1-Chinese participants would often misdetect countability of abstract referents: in a grammaticality judgment task, they more readily accepted “much sentence” (abstract) than “much computer” (concrete). This is, however, indirect evidence, as learners accepting “much sentence” would not necessarily omit “a” before “sentence.” In Butler (Reference Butler2002), up to 20% of L1-Japanese learners’ errors were due to misdetection of countability, especially of abstract referents, although the paper does not detail error types, so it is not easily comparable with Trenkic (Reference Trenkic2002).

Snape (Reference Snape2008) argues that learners with L1s without a count/mass distinction (e.g., Japanese) may associate definiteness with number features—that is, use articles only with count singulars, as bare mass/plural nouns are allowed in English. In a forced-choice elicitation task (definites only), advanced L1-Japanese participants tended to omit “the” with plural/mass nouns—for example, “I’ve just finished our new patio. […] Mixing [the] cement was difficult” (Reference Snape2008, p. 70), whereas advanced L1-Spanish learners omitted “the” to a lesser extent and only with plurals in larger situation/cultural contexts—for example, “Have you seen [the] bridesmaids?” at a wedding (Reference Snape2008, p. 66).

In summary, number and countability affect article accuracy of [-art] learners, who may generalize the bare mass/plural indefinite pattern in English to all mass/plural nouns. The count/mass distinction, especially in abstract nouns, further influences article accuracy if learners’ L1 lacks grammatical number. Learners may also use “a” to individuate concepts without clear boundaries (abstract referents, mass nouns), whereas “the” may mark clearly bounded entities (concrete referents, count singulars).

Research objectives

Based on previous research and cross-linguistic differences, multiple factors may influence learners’ article interpretation and use: [+/-art] L1 and proficiency level, semantic features not encoded in English (specificity, familiarity, abstractness), and morphosyntactic and syntactic features (number, countability, syntactic position, premodification).

We seek to investigate how these factors interact to predict article accuracy and error types (omission, substitution, overuse) in learners. The benefit of including multiple factors is the ability to analyze their interactions, which may reveal differential effects. By differential effects we mean that an interaction between specificity and definiteness, for example, might mean that learners tend to supply “the” with specific definites but omit “a” with specific indefinites. This is a step forward from previous research, where findings were either limited to a certain category of nominals (Ionin et al., Reference Ionin, Ko and Wexler2004, and replications often include only concrete count singulars) or aggregated across different categories (Murakami & Alexopoulou, Reference Murakami and Alexopoulou2016a, a large-scale study aggregated results across number, countability, abstractness, etc.).

Including multiple factors can also help avoid overestimating the effect of any one predictor, as multiple regression modeling allows for the estimation of the effect of each variable while keeping the rest constant.

Methodology

Learner data

Corpus

We used a large open-access learner corpus, EFCAMDAT, the EF Education First Cambridge Open Language Database (Alexopoulou et al., Reference Alexopoulou, Geertzen, Korhonen and Meurers2015, Reference Alexopoulou, Michel, Murakami and Meurers2017; Geertzen et al., Reference Geertzen, Alexopoulou, Korhonen, Miller, Martin, Eddington, Henery, Miguel, Tseng, Tuninetti and Walter2013; Michel et al., Reference Michel, Murakami, Alexopoulou and Meurers2019). EFCAMDAT contains 1,180,310 writings (scripts) responding to communicative tasks (holiday postcards, film reviews, describing personal experiences), submitted by registered learners of a large number of nationalities (anonymized) to EF’s online language school.Footnote ⁵ EFCAMDAT contains 16 proficiency levels aligned with the Common European Framework of Reference (CEFR) and is pseudolongitudinal (most learners do not complete all the levels). New learners are placed in levels 1/4/7/10/13 based on placement tests. Each level comprises eight modules, each ending with a writing task. Because EFCAMDAT does not contain direct L1 data, National Language (NL), crossing nationality with country of access to the online school, is used as a proxy for L1. This has been shown to be quite reliable (Alexopoulou et al., Reference Alexopoulou, Geertzen, Korhonen and Meurers2015; Murakami, Reference Murakami2013) despite the inevitable noise in the data (e.g., multilingualism is not captured).

Subcorpus

We sampled writings from two [+art] subgroups (German and Brazilian Portuguese, henceforth Brazilian), and two [-art] subgroups (Chinese, Russian). We included two languages from each language type to tease apart the typological effect of article presence/absence from potentially L1-independent effects (Murakami & Alexopoulou, Reference Murakami and Alexopoulou2016a).

We sampled 660 scripts (165 per NL), which according to our power analysis based on a simulated dataset in R (R Core Team, 2021) would provide 80% statistical power (details in the Online Supplementary Materials).

The scripts were randomly selected and equally distributed across A2–B2 CEFR levels (inclusive), corresponding to EF levels 4–12,Footnote ⁶ and equally distributed across topics—that is, the specific writing prompts at the end of each module (e.g., “Write a short autobiography”)—and no more than one script was contributed by the same learner (examples in Figure 1).

Figure 1. Example scripts with article errors marked.

Coding

We manually retrieved all nominals from the scripts. We treat as a nominal any phrase consisting of a noun, an optional article, and a prenominal modifier—for example, “book,” “an interesting book,” “the books”—but excluding demonstratives and quantifier items, such as “this book”or “many interesting books” (for full list of exclusions, see the Online Supplementary Materials).

We excluded formulaic sequences (e.g., “for example,” “all over/around the world,” “in the morning,” “twice a week,” “make a long story short,” etc.), which are expected to have higher accuracy rates (Myles, Reference Myles, Housen, Kuiken and Vedder2012; details in the Online Supplementary Materials).

Additionally, we excluded sequences provided in writing prompts and model answers, which learners often copied in their writing. All the coded variables are listed in Table 1 (see coded example in the Online Supplementary Materials).

Table 1. Variable coding

To determine coding reliability, 100 randomly selected items were coded by another doctoral student of applied linguistics (English native speaker). The level of agreement was strong for all variables, κ > 0.85 (McHugh, Reference McHugh2012).

The resulting subcorpus contained 5,772 nominals (Table 2).

Table 2. Distribution of nominals retrieved

Statistical modeling

Binomial and multinomial mixed-effects logistic regression

We investigated the effect of our independent variables on article accuracy using a generalized linear mixed-effects logistic regression model (henceforth, accuracy model), where the dependent variable is binary (correct/incorrect article [non]use), using the lme4 package in R (Bates, Mächler, et al., Reference Bates, Mächler, Bolker and Walker2015).

For further analysis of error types, we used multinomial logistic regression models, which allow for more than two levels in the categorical outcome, using the mclogit R package (Elff, Reference Elff2021). This type of model estimates predictor variable effects on the change in the odds of the different outcomes (omission, substitution, overuse) compared with a chosen baseline outcome (no error). Thus, we could explore, for example, whether a mass noun as opposed to a count singular increases the odds of article omission versus no error.

For error type analysis, our data was split into two subsets by target article,Footnote ⁸ each with different possible outcomes.Footnote ⁹

• Indefinite count singular (obligatory “a,” 1,679 observations): no error, omission of “a,” substitution (“the” instead of “a”)
• Indefinite plural and mass (obligatory Ø, 2,060 observations): no error, overuse of “the,” overuse of “a”

Model selection

When choosing the fixed-effects structure for each model, we started with a comprehensive model including all the independent variables of interest (without interactions). We then attempted interactions that were either theoretically motivated or that appeared likely from the visual examination of the data. We compared models using log-likelihood ratio tests (LRT) and excluded any interactions that did not significantly improve model fit. We did not exclude any independent variables unless they caused convergence issues or led to inadequate coefficient or standard error estimates.

Random effects were included in all models because each writing (wr_ID) contained multiple observations. We also included the topic (prompt) ID as a random effect to capture potential prompt effects where possible. In choosing the random effects structure, we adhered to the parsimonious approach by Bates, Kliegl, et al. (Reference Bates, Kliegl, Vasishth and Baayen2015). They argue for removing any random-effects components that explain little (close to 0%) variation (as estimated using their random-effects principal components analysis function from the RePsychLing package in R) and that do not contribute significantly to improving model fit (as estimated by the LRT). We also excluded any random-effects components that caused convergence issues. The model selection process for each model is described in the Online Supplementary Materials.

Finally, we used the emmeans package in R (Lenth, Reference Lenth2022) for post hoc pairwise comparisonsFootnote ¹⁰ and for generating predicted values of the outcome variable.

Analysis and results

Data distribution and observed accuracy rates

Our data distribution across the main variables is presented in mosaic plots in Figure 2.

Figure 2. Observed distribution of target contexts across target, specificity, modifier, abstractness, syntactic position.

The plots reveal the following patterns:

• Top right: the three target contexts are approximately equally represented (this shows the number of contexts in which each target is expected, or would be correct, not actual article use by learners).
• Top left: definites tend to be specific—that is, refer to a certain existing (in a general sense) individual or entity, and indefinites nonspecific (especially mass/plural).
• Middle left: definites are less often premodified than indefinites, and mass nouns are not as frequently premodified as count singulars irrespective of definiteness.
• Bottom left: mass indefinites tend to be abstract, whereas in other categories abstract and concrete nouns are equally distributed.
• Bottom right: definites are more often used in subject position than indefinites.

Figure 2 demonstrates that although there are tendencies and potential correlations between semantic features and individual articles, there is no one-to-one mapping, thus confirming the learning challenge.

Figure 3 shows accuracy rate measured as the number of correct uses (including correct Ø) divided by the total number of obligatory contexts (including target Ø).Footnote ¹¹ Note that L1-Chinese and L1-Russians (both [-art]) do not pattern together (see Murakami & Alexopoulou, Reference Murakami, Alexopoulou, Denison, Mack, Xu and Armstrong2016b, for similar findings). L1-Chinese learners’ scores may appear rather high. Nevertheless, they are consistently less accurate than L1-Germans. Because L1-Chinese behave very similarly to L1-Brazilians, we further analyze each NL separately without combining them into [+/-art] types.Footnote ¹²

Figure 3. Development across EF levels.

The distribution of error types is detailed in Table 3 and Figure 4.

Table 3. Error-type distribution

* 16% of all observations.

Figure 4. Error-type distribution across NL, target article, and noun type. Numbers represent instances.

In obligatory article contexts, 81% of errors are omissions (see Figure 1 for examples). In target Ø contexts, overuse of “a” and “the” is mostly equal for mass nouns, whereas “the” overuse is more common in plurals. Generally, the patterns are similar across NLs, with two exceptions: (a) the use of “a” (a_sing), where Germans have a higher proportion of substitutions than others, and (b) the use of Ø with mass nouns (zero_mass), where Germans overuse “the” more often than “a,” whereas other NLs overuse both equally, but the overall error count for L1-Germans is small.

Predictors of accuracy

The accuracy model (Table 4) reveals significant effects of NL, proficiency level (interacting with NL), specificity, and modifier, which vary by target article and noun type, and also effects of syntactic position and abstractnessFootnote ¹³ (full results in Table 5).

Table 4. Accuracy model formula

Table 5. Accuracy model results

Note. Reference levels: “Russian” for NL, “plural” for Ntype, “object” for syntactic position (synt).

* p < .05;

^** p < .01;

^*** p < .001.

NL in interaction with proficiency level

Figure 5 illustrates the effect of NL across definiteness and noun type (averaged across proficiency levels). The interaction with definiteness and noun type stems from the fact that the target is different across combinations of variable levels (“the” for all definites, “a” for count singular indefinites, Ø for mass/plural indefinites). Thus, the top three facets and the bottom-left facet of Figure 5 reflect the rate of suppliance of “the” and “a,” whereas the bottom-middle and right facets show accuracy rates in Ø contexts, where any errors would be overuse.

Figure 5. The effect of NL across definiteness and noun type.Footnote ¹⁴

All but L1-German learners are significantly less accurate with “a” than with “the” in singular contexts (L1-Brazilian and L1-Chinese: p < .001; L1-Russian: p = .035).

The main NL effect concerns the significantly lower accuracy of L1-Russians in obligatory “the” and “a” contexts, which drops further in definite mass (17) and plural (18) contexts, showing a sensitivity to noun type not observed in the other NLs. The difference between L1-Brazilians and L1-Germans in plural definites also reaches statistical significance at p = .044; however, L1-Brazilians’ accuracy is still quite high. Note the comparatively lower number of mass/plural definite observations.

Another NL effect is observed within singular indefinites (target “a”), where L1-Brazilian, L1-Chinese, and L1-Russian learners are all predicted to be less accurate than L1-Germans, and L1-Russians are also significantly less accurate than L1-Brazilians (19–21).

In target Ø contexts, there are two possibilities: (a) learners may tend to omit articles across the board and happen to be correct with mass or plural indefinites (coincidentally correct use) or (b) learners may be aware that mass/plural indefinites require Ø (genuinely correct use). Because the corpus provides performance data only, we cannot confidently distinguish between the two. We hypothesize (a) is more likely for L1-Russians, considering their overwhelming tendency to omit articles elsewhere.

Proficiency level interacts with NL, definiteness, and noun type (Figure 6) in that it has a significant effect only in indefinite singulars (target “a”), with L1-Russians significantly growing (p < .001) and L1-Chinese declining (p = .023) in accuracy, with the two slopes being significantly different from each other (p < .001). Note that the estimates for mass/plural definites are rather unreliable with large confidence intervals.

Figure 6. Proficiency level by NL across definiteness and noun type.

Specificity

Specificity, as defined in this study, affects only two contexts (Figure 7, left): (a) indefinite singulars (target “a”), where accuracy is significantly lower for nonspecific (22) than for specific (23) referents; (b) indefinite mass (target Ø), where the effect is the opposite, with significantly higher accuracy for nonspecific (24) than for specific (25) referents.

Figure 7. The effect of specificity (left) and modifier presence (right) across definiteness and noun type.

Modifier presence

A prenominal modifier (Figure 7, right) negatively affects accuracy in singular definites, where it increases “the” omission (26), and in mass indefinites (target Ø), where it increases article overuse (27).

Syntactic position

Errors are significantly more likely in subject (28) and object (29) positions (both at 86% predicted accuracy) than in predicate (30) position (91% predicted accuracy) at p = .02 and p = .002 (Tukey adjusted), respectively (Figure 8, note the scale starts at 60%). The 95% predicted accuracy in existential position (31) is not significantly higher than that in subject or object positions due to a larger confidence interval.

Figure 8. The effect of syntactic position.

Predictors of error type

Count singular indefinites

The error-type model for count singular indefinites (target “a”; Table 6) confirms the significant interaction between NL and proficiency level (Figure 9) and the significant effect of syntactic position (similar pattern to that in the accuracy model) while revealing that the differences in accuracy rates are driven by omission errors, with low substitution error rates across NLs. Additionally, we find significant interactions between specificity and NL, specificity and modifier, specificity and abstractnessFootnote ¹⁸ (Table 7). The model predicts 82.5% probability for correct “a” suppliance, 13.5% omission, 4% substitution (averaged across other variables).

Table 6. Error-type model formula for count singular indefinites

Figure 9. The effect of NL alone (top) and in interaction with level (bottom) on predicted probabilities of error types in count singular indefinites.

Table 7. Error-type model for count singular indefinites results

Note. omit~ estimates for omission errors vs. correct; sub~ substitution errors vs. correct. Reference levels: “Russian” for NL, “object” for syntactic position (synt).

* p < .05;

^** p < .01;

^*** p < .001.

Interaction between NL and specificity. The model reveals that only in nonspecific contexts (i.e., those not referring to certain existing entities) do all NLs show significantly lower accuracy than L1-Germans (Figure 10). Meanwhile, in specific contexts only L1-Russians appear to be behind, although L1-Chinese also demonstrate significantly higher “a” omission than L1-Germans. In other words, in indefinite count singulars specificity has no effect on L1-Germans, who are at ceiling, or on L1-Russians, who tend to omit “a” regardless of specificity. However, L1-Brazilians (p = .006) and L1-Chinese (p = .023) omit “a” significantly more often with nonspecific referents.

Figure 10. The effect of specificity in interaction with NL on predicted probabilities of error types in count singular indefinites.

Interaction between specificity and modifier. There is a significant effect of specificity, as defined in this study, in non-premodified nouns, with the odds of omitting “a” dropping to 8% for specific nouns as opposed to 17% for nonspecific ones (22; Figure 11, left). The trend in premodified nouns (Figure 11, right) is the same but the difference at only 5% becomes statistically nonsignificant.

Figure 11. The effect of specificity in interaction with modifier on predicted probabilities of error types in count singular indefinites.

Mass indefinitesFootnote ¹⁹

The model (Table 8) predicts 82% probability for correct Ø use, 11% “a” overuse, 7% “the” overuse (averaged across other variables). This is slightly lower than the predicted accuracy rate from the accuracy model, which was above 85% (full results in Table 9).

Table 8. Final model formula for error-type model for indefinite mass nouns

Table 9. Error-type model for indefinite mass nouns results

Note. over_a~ estimates for overuse of “a” errors vs. correct; over_the~ overuse of “the” errors vs. correct. Reference level: “Russian” for NL.

* p < .05;

^** p < .01;

^*** p < .001.

The model confirms that the only two significant variables in this context are specificity and modifier presence (Figure 12). Both effects are driven by the rate of “a” overuse: learners tend to overuse “a” more often before specific (25) and before premodified nominals (32).

Figure 12. The effects of specificity (left) and modifier presence (right) on predicted probabilities of error types in mass indefinites.

Discussion

Summary of findings

Using manually coded learner corpus data and statistical modeling, we have revealed that the main factors affecting article accuracy are involved in several complex interactions—that is, their effects are not uniform across contexts. The most important findings are the following:

1. Clear L1 effect: article accuracy is generally higher in [+art] than in [-art] learners, although some NL-groups do not perform as expected (notably, L1-Chinese), and the effect varies across contexts (definite/indefinite, singular/mass/plural).
2. Specificity effect (defined as reference to a certain existing entity): only affects indefinite contexts (target “a”/Ø), where “a” is generally more likely to appear with specific referents.
3. Prenominal modifier effect: distinct in definites versus indefinites—namely, modifiers increase “the” omission with definite singulars but increase “a” overuse with indefinite mass nominals (target Ø). There is, however, no modifier effect in indefinite singulars (target “a”).
4. Syntactic position effect: higher accuracy in existential and predicate as opposed to subject and object positions.

We will discuss each finding in more detail, combining the intricately related Findings 2 and 3 into one subsection.

NL in interaction with proficiency, definiteness, number, and countability

The higher accuracy of [+art] learners in our study largely confirms previous findings (Ionin et al., Reference Ionin, Zubizarreta and Maldonado2008; Murakami & Alexopoulou, Reference Murakami and Alexopoulou2016a; Snape, Reference Snape2008). As expected (Lardière, Reference Lardiere, Brugos, Micciulla and Smith2004), all learners are more accurate in using “the” than “a,” except L1-Germans, who are at ceiling for both. However, a third of all nominals used by learners require Ø (these are mass and plural indefinites). Considering that omission is the most common error, this explains ~90% accuracy in Ø contexts across all NLs, including even L1-Russians (in stark contrast with their 55% accuracy in mass definites).

What requires additional explanation is the results of L1-Chinese learners. First, their accuracy seems rather high (~80%), although studies involving this population have demonstrated accuracy rates > 70% on article gap-fill tasks (Snape, Reference Snape, Snape, Leung and Smith2009; Ting, Reference Ting2005; Trenkic, Reference Trenkic2008; Zhao & MacWhinney, Reference Zhao and MacWhinney2018).Footnote ²⁰ As for comparable production data, two studies using a picture-story task showed relatively high article suppliance rates in count singulars: overall 70% in 13 out of 15 lower to higher intermediate L1-Chinese learners (Goad & White, Reference Goad and White2008), 98% for definites, and 89% for indefinites (only non-premodified) in 15 upper-intermediate learners (Snape, Reference Snape, Snape, Leung and Smith2009). Considering that our data comprises writings produced offline as homework, the higher accuracy rates are not unexpected.

What is surprising is that [-art] L1-Chinese learners pattern with [+art] L1-Brazilians and not with [-art] L1-Russians. There are proposals that Mandarin Chinese is developing a definite article, which may even assume a functional projection (Cheng et al., Reference Cheng, Heycock, Zamparelli and Erlewine2017; Huang, Reference Huang1999) as well as an indefinite article (Chen, Reference Chen2003). Cultural differences may also be at play—for example, L1-Chinese learners may be more performance driven than L1-Russians. Finally, it is unclear why the accuracy of L1-Chinese learners decreases with proficiency (unlike the other NLs), particularly with count singular indefinites (target “a”). More data from higher level learners might clarify whether this is true and significant decline or part of a fluctuating or U-shaped curve.

Specificity and modifier in interaction with definiteness, number, and countability

Definites

As noted in Findings 2 and 3, definites are not affected by specificity, but there is increased “the” omission with premodified count singulars—that is, “the” is more likely omitted in “the advertising company” than in “the company.” Trenkic (Reference Trenkic2007), based on similar findings from L1-Serbians, suggests the article is structurally an adjective for learners, making it optional. So, when a modifier has already sufficiently narrowed down the range of potential referents, an article may be redundant. As we can see in our own data, in many cases the modifier leaves only one plausible referent option—for example, “the departure lounge of Oslo airport,” “the following recipe,” “the top score.” We could adopt Trenkic’s cognitive explanation, which suggests increased omission of redundant elements when cognitive resources are limited. However, we still need to explain why in our data this redundancy effect is found in definites but not in indefinites, which we will address in the following subsection.

Two findings remain unclear. First, definite mass and plural contexts are unaffected by modifier presence: “the” omission is not increased before premodified mass/plurals—for example, “the red wine”/“the new shoes.” Second, L1-Russians have considerably lower accuracy in mass and plural contexts (predicted 55% and 66%) than in count singulars (82%). Austin et al. (Reference Austin, Pongpairoj and Trenkic2015) also observe a higher “the” omission rate with plurals than with singulars in 20 intermediate L1-Thai learners (prompted story recall). They attribute this to L1–L2 structural competition, which predicts that cognitively more demanding contexts, such as those requiring the suppliance of multiple functional morphemes (e.g., “the” and plural “-s” in plural definites), impede the suppression of competing L1 forms (i.e., bare plural definites). However, this only explains the higher omission in definite plurals but not in mass nominals. We cannot fully explain these patterns, which might also be rather uncertain due to lower numbers in these contexts (256 mass, 335 plural) and larger standard errors.

Count singular indefinites

The first question is how and why count singulars (target “a”) are significantly affected by specificity. Essentially, “a” is more consistently supplied with specific referents but more often omitted with nonspecific ones. We claim learners may associate “a” with the function of introducing a certain existing referent (i.e., specific, by our definition) into the discourse. In contrast, in nonspecific contexts, where “a” is not introducing an existing referent (as there is none), the semantic contribution of “a” may be unclear to learners.

However, the effect is only significant for L1-Brazilians and L1-Chinese (Figure 10; L1-Germans are at ceiling, and L1-Russians’ predicted accuracy rate is at 73% regardless of specificity). We suggest L1-Brazilians may draw on the indefinite article in their L1, especially in specific contexts. This could be because the use of bare singulars in argument position (which is allowed in Brazilian Portuguese) is more restricted in specific contextsFootnote ²¹ (Ferreira, Reference Ferreira, Hofherr and Doetjes2021). We could also argue that L1-Chinese learners benefit from an emerging indefinite article in their L1 (numeral “yi” meaning “one”), which is also more common in specific than in nonspecific contexts (Chen, Reference Chen2003, pp. 1159–1160). In this case, L1-Russians are the only ones with nothing to rely on in their L1.

The second question is why there is no modifier effect in count singular indefinites (Finding 3). The only slight influence of modifier presence is that the effect of specificity described above becomes nonsignificant in premodified contexts. To illustrate, consider the non-premodified example (33), where less omission is predicted because the speaker has a specific question in mind (as opposed to a context where “a question” does not refer to any specific question). When a nominal is premodified, however (34), this specificity effect becomes statistically nonsignificant. Nevertheless, the pattern is in the same direction (Figure 11, right), so the tendency is similar, albeit smaller.

Returning to the question, we need to explain why the modifier does not appear to make “a” redundant in the same fashion as it can make “the” redundant. If we assume, as suggested above, that learners associate “a” with the function of introducing a specific (existing) referent, we have to admit that a modifier cannot fulfil this function. There is also no evidence that learners use “a” to signal referent identifiability, which is the function of “the,” as there are few substitution errors. Therefore, although a modifier can narrow the range of possible referents, it may still only indicate a type—for example, “We are seeking an experienced analyst” as opposed to “any analyst”—if we accept that learners do not consider “analyst” identifiable in the first place.Footnote ²²

This is unlike the findings in Trenkic (Reference Trenkic2007), whose [-art] L1-Serbian participants tended to omit both “the” and “a” with premodified nominals. The discrepancy is partly explained by the different task types. Trenkic used an oral information gap task (map completion) and a written task asking participants to translate as many stories as they could within the time limit, ensuring less reliance on metalinguistic knowledge. These online tasks revealed higher omission rates than the tasks in our corpus, which were untimed and unsupervised. Nevertheless, in Trenkic’s written task, the modifier effect was overall more pronounced in definite than in indefinite contexts, which is in the same direction with our pattern of a significant (but smaller) effect in definites and no significant effect in indefinites.

Note also that our results clearly differ from those in Ionin et al. (Reference Ionin, Ko and Wexler2004) and replications, as we detect specificity effects in both [+art] and [-art] learners and we observe few substitution errors. This is partly due to the differences in defining specificity (see Footnote fn3) and partly due to the different types of data.

Mass indefinites

In mass indefinites (target Ø), learners overuse “a” more often both with specific referents and with premodified nominals. We argue that this is consistent with our explanation for count singular indefinites above. If learners use “a” to introduce a certain existing referent, they would not use “a” with most mass nouns, which typically denote unbounded or vaguely defined entities. However, when a mass noun is used to refer to something specific, it will often refer to a portion or an instance of the entity, and learners might be using “a” to indicate this (35–38). A prenominal modifier can additionally specify a subclass or a type of entity, which is arguably more likely to occur when a specific portion or instance is referred to.Footnote ²³

Syntactic position

The higher article accuracy in existentials and predicates as opposed to subject and object positions is broadly in line with the literature (Hua & Lee, Reference Hua, Lee, Dekydtspotter, Sprouse and Liljestrand2005, only for nonspecific contexts). One possible explanation is that the discourse and semantic properties of existential and predicate constructions are almost fixed regardless of the noun inserted: stating existence and denoting properties, respectively. They are also explicitly taught early on and may first be learned as formulaic sequences—for example, “There is a book on the table,” “I am a student.”

In contrast, the use of nominals in subject and object positions is much more varied, so it is difficult for learners to infer any patterns of distribution, as in the case of existentials and predicates.

Conclusion

Our study has revealed previously unnoticed and complex interactions between specificity, modifier presence, definiteness, number/countability, and L1 (NL). Overall, our data points to a semantic interpretation of articles by learners (except L1-Germans), broadly in line with Trenkic (Reference Trenkic2007). We conclude that learners associate “the” with definiteness (in the sense of an identifiable discourse referent) and “a” with introducing a specific (i.e., existing and certain) referent into the discourse.

The practical implications for learning and teaching are mainly around focusing learners’ attention on the structural features of the indefinite article (number and countability) rather than semantic features (specificity). Crucially, learners’ ability to use “a” correctly may depend on their understanding of countability in English.

An important limitation of this study is that the writings in the corpus were completed offline as homework, which implies preparation and the possibility to edit responses. As a result, the observed accuracy rates are probably overestimated and could be considerably lower in spontaneous, timed, or unprepared (written or oral) production.

Further research would benefit from extending this analysis to other [-art] L1s to ensure the observed patterns are not specific to L1-Russians (as L1-Chinese behaved similarly to [+art] groups). Larger sample sizes would improve the ability to detect effects of such cumbersome multilevel variables as syntactic position, especially in contexts where error rates are already low (e.g., plural definites or mass/plural indefinites).

Supplementary material

The supplementary material for this article can be found at http://doi.org/10.1017/S0272263123000463.

Data Availability Statement

The experiment in this article earned Open Data badge for transparent practices. The data is available at https://www.iris-database.org/details/42UaQ-jWDMt.

Competing interest

The authors declare none.

Footnotes

¹ We remain agnostic regarding the representation of Ø in speakers’ minds (if any). We use “Ø” to mean “no article.”

² Although discourse functions were initially investigated, the results were inconclusive and are not discussed here, although we acknowledge they play a role in learner article use, as shown in previous research (details in the Online Supplementary Materials).

³ Ionin et al. defined specificity differently, as speaker’s intent to refer to something noteworthy. For example, in “He got lots of gifts—books, toys. And best of all—he got a puppy!,” “a puppy” is nonspecific because its identity is “irrelevant for the discourse” (Reference Ionin, Ko and Wexler2004, p. 23). Previously discussed authors (Butler, Reference Butler2002; Huebner, Reference Huebner1985; Thomas, Reference Thomas1989, and others) would consider this example specific, as it refers to a certain existing puppy.

⁴ In all examples the original spelling and grammar are kept. Nominals of interest are italicized. Article corrections are given in square brackets. Erroneous articles are struck through.

⁵ Because the writings were completed offline (as homework) with access to resources, learners’ accuracy tends to be higher on EFCAMDAT when compared with an exam learner corpus, such as Cambridge Learner Corpus (CLC). This means learners may use external help; however, these tasks are very low stakes, and learners would not benefit in any way from submitting perfect answers. Indeed, the accuracy patterns for different morphemes are similar in EFCAMDAT and CLC across proficiency levels (see comparison in the Online Supplementary Materials).

⁶ We excluded lower levels, where the writings are mostly formulaic, and higher levels, where there is generally less data in the corpus.

⁷ Other syntactic positions were excluded because they constituted less than 10% of the data (temporal modifiers, e.g., “last week”; appositives, e.g., “Tom, the leading man”; genitives, e.g., “people’s attitudes”).

⁸ We intended to analyze three subsets, but the model on definites did not converge, most probably due to the uneven error distribution across noun types (only five substitution errors in mass, no substitution errors in plural contexts) and the fact that most errors were made by L1-Russians and L1-Chinese. The model converged without L1-Germans; however, adding any interactions led to more convergence issues, whereas the pseudo-R ² for the no-interaction model was only .017. Thus, we are not reporting this model here.

⁹ The split created difficulties with random effects structures: we could only include random intercepts by wr_ID, as adding any random slopes resulted in singular fits.

¹⁰ emmeans automatically applies the Tukey adjustment method when comparing families of > 2 estimates.

¹¹ This is unlike target language use (TLU) often used in the literature, which does not include correct Ø contexts. See comparison in the Online Supplementary Materials.

¹² A reviewer suggested it might still be worth comparing [+art] and [-art] groups and including NL as a random effect in the models. However, the minimum number of levels required to obtain a reasonable random effect estimation is 5–6 (Bolker, Reference Bolker2022), whereas NL has only 4. Indeed, an attempted model with NL as a random effect did not converge. Had L1-Chinese performed similarly to L1-Russians, we might have grouped learners according to [+/-art] and ignored the NL subdivision. However, as L1-Chinese appeared to pattern with L1-Brazilians, grouping them with L1-Russians would make little sense.

¹³ The effect of abstractness, although statistically significant (p = 0.024), is very small (89% predicted accuracy for concrete nouns, 95% CI [85%, 92%], and 91% predicted accuracy for abstract nouns, 95% CI [88%, 93%]) and, thus, not discussed further.

¹⁴ We have labeled y-axes “Predicted accuracy rate” throughout for ease of interpretation, although the numbers in fact represent predicted probabilities of correct article (non)use in a single instance, which is conceptually the same. In all figures, error bars represent 95% confidence intervals.

¹⁵ VIFs (variance inflation factors) are indicators of multicollinearity: <5 low, 5–10 moderate, >10 strong collinearity to be avoided (James et al., Reference James, Witten, Hastie and Tibshirani2013).

¹⁶ Overdispersion checked using performance package in R (Lüdecke et al., Reference Lüdecke, Ben-Shachar, Patil, Waggoner and Makowski2021).

¹⁷ For all corpus examples, we provide learner’s L1, CEFR level, and wr_ID.

¹⁸ The interaction between specificity and abstractness is not discussed further because of its weakness: the effect of specificity is only significant in concrete nouns, but the trend is the same for abstract nouns and is, in fact, approaching significance (p = .06).

¹⁹ Fitting the model on both mass and plural indefinites produced negative pseudo-R ² values (Nagelkerke, Reference Nagelkerke1991). The same problem occurred when fitting a separate model on plurals. Therefore, we fitted the model on mass nouns only (n = 878).

²⁰ The only exception is the 63% accuracy rate in specific definite contexts with explicit denial of speaker familiarity with the referent in Trenkic (Reference Trenkic2008).

²¹ For example, bare singulars are ungrammatical in subject position of episodic predicates, where the referent is often specific, e.g., “*Cachorro está latindo”—“A dog is barking.” Moreover, bare singulars in episodic sentences can be interpreted as number neutral rather than necessarily singular, e.g., “Maria comeu maçã”—“Maria ate (an/some) apple” (examples from Ferreira, Reference Ferreira, Hofherr and Doetjes2021).

²² We could also speculate that definite contexts are cognitively more demanding than indefinite ones due to the need to keep track of the discourse, but this would need to be confirmed in an online experiment.

²³ In fact, > 40% of specific mass indefinites have prenominal modifiers in our data, whereas of nonspecific mass indefinites, only ~ 25% are premodified.

References

Alexopoulou, T., Geertzen, J., Korhonen, A., & Meurers, D. (2015). Exploring big educational learner corpora for SLA research: Perspectives on relative clauses. International Journal of Learner Corpus Research, 1, 96–129.CrossRef Google Scholar

Alexopoulou, T., Michel, M., Murakami, A., & Meurers, D. (2017). Task effects on linguistic complexity and accuracy: A large-scale learner corpus analysis employing natural language processing techniques. Language Learning, 67, 180–208.CrossRef Google Scholar

Austin, G., Pongpairoj, N., & Trenkic, D. (2015). Structural competition in second language production: Towards a constraint-satisfaction model. Language Learning, 65, 689–722.CrossRef Google Scholar

Bates, D., Kliegl, R., Vasishth, S., & Baayen, H. (2015). Parsimonious mixed models. Arxiv. https://arxiv.org/pdf/1506.04967v1.pdf Google Scholar

Bates, D., Mächler, M., Bolker, B. M., & Walker, S. C. (2015). Fitting linear mixed-effects models using lme4. Journal of Statistical Software, 67, 1–48.CrossRef Google Scholar

Bickerton, D. (1981). Roots of language. Karoma.Google Scholar

Bolker, B. (2022). GLMM FAQ. http://bbolker.github.io/mixedmodels-misc/glmmFAQ.html Google Scholar

Butler, Y. G. (2002). Second language learners’ theories on the use of English articles. Studies in Second Language Acquisition, 24, 451–480.CrossRef Google Scholar

Chen, P. (2003). Indefinite determiner introducing definite referent: A special use of “yi ‘one’ + classifier” in Chinese. Lingua, 113, 1169–1184.CrossRef Google Scholar

Cheng, L. L.-S., Heycock, C., & Zamparelli, R. (2017). Two levels for definiteness. In Erlewine, M. Y. (Ed.), Proceedings of GLOW in Asia XI (Vol. 1, pp. 79–93). MIT Working Papers in Linguistics.Google Scholar

DeKeyser, R. M. (2005). What makes learning second-language grammar difficult? A review of issues. Language Learning, 55, 1–25.CrossRef Google Scholar

Dryer, M. S. (2013). Definite articles In Dryer, M. S. & Haspelmath, M. (Eds.), The world atlas of language structures online (chap. 37). Max Planck Institute for Evolutionary Anthropology. http://wals.info/chapter/37 Google Scholar

Elff, M. (2021). mclogit: Multinomial logit models, with or without random effects or overdispersion [Computer software] (0.8.7.3). R Foundation for Statistical Computing.Google Scholar

Ferreira, M. (2021). Bare nominals in Brazilian Portuguese. In Hofherr, P. Cabredo & Doetjes, J. (Eds.), The Oxford handbook of grammatical number (pp. 497–521). Oxford University Press.CrossRef Google Scholar

Frege, G. (1960). On sense and reference. In Geach, P. & Black, M. (Eds.) & Black, M. (Trans.), Translations from the philosophical writings of Gottlob Frege (pp. 56–78). Blackwell.Google Scholar

García Mayo, M. del P. (2009). Article choice in L2 English by Spanish speakers: Evidence for full transfer. In García Mayo, M. del P. & Hawkins, R. (Eds.), Second language acquisition of articles: Empirical findings and theoretical implications (pp. 13–35). John Benjamins Publishing.CrossRef Google Scholar

Geertzen, J., Alexopoulou, T., & Korhonen, A. (2013). Automatic linguistic annotation of large scale L2 databases: The EF-Cambridge Open Language Database (EFCamDat). In Miller, R. T., Martin, K. I., Eddington, C. M., Henery, A., Miguel, N. M., Tseng, A. M., Tuninetti, A., & Walter, D. (Eds.), Selected proceedings from the 31st Second Language Research Forum (240–254). Cascadilla Proceedings Project.Google Scholar

Goad, H., & White, L. (2008). Prosodic structure and the representation of L2 functional morphology: A nativist approach. Lingua, 118, 577–594.CrossRef Google Scholar

Hawkins, J. A. (1978). Definiteness and indefiniteness. Croom Helm.Google Scholar

Hawkins, J. A. (1991). On (in)definite articles: Implicatures and (un)grammaticality prediction. Journal of Linguistics, 27, 405–442.CrossRef Google Scholar

Hawkins, R., Al-Eid, S., Almahboob, I., Athanasopoulos, P., Chaengchenkit, R., Hu, J., Rezai, M., Jaensch, C., Jeon, Y., Jiang, A., Leung, Y. I., Matsunaga, K., Ortega, M., Sarko, G., Snape, N., & Velasco-Zárate, K. (2006). Accounting for English article interpretation by L2 speakers. EUROSLA Yearbook, 6, 7–25.CrossRef Google Scholar

Heim, I. (2019). Definiteness and indefiniteness. In Portner, P., Heusinger, K., & Maienborn, C. (Eds.), Semantics: Noun phrases and verb phrases (pp. 33–69). De Gruyter Mouton.CrossRef Google Scholar

Hua, D., & Lee, T. H. (2005). Chinese ESL learners’ understanding of the English count-mass distinction. In Dekydtspotter, L., Sprouse, R. A., & Liljestrand, A. (Eds.), Proceedings of the 7th Generative Approaches to Second Language Acquisition Conference (138–149). Cascadilla Proceedings Project.Google Scholar

Huang, S. (1999). The emergence of a grammatical category definite article in spoken Chinese. Journal of Pragmatics, 31, 77–94.CrossRef Google Scholar

Huebner, T. (1983). A longitudinal analysis of ‘the’ acquisition of English. Karoma.Google Scholar

Huebner, T. (1985). System and variability in interlanguage syntax. Language Learning, 35, 141–163.CrossRef Google Scholar

Ionin, T., Ko, H., & Wexler, K. (2004). Article semantics in L2 acquisition: The role of specificity. Language Acquisition, 12, 3–69.CrossRef Google Scholar

Ionin, T., Zubizarreta, M. L., & Maldonado, S. B. (2008). Sources of linguistic knowledge in the second language acquisition of English articles. Lingua, 118, 554–576.CrossRef Google Scholar

James, G., Witten, D., Hastie, T., & Tibshirani, R. (2013). An introduction to statistical learning. Springer.CrossRef Google Scholar

Lardiere, D. (2009). Some thoughts on the contrastive analysis of features in second language acquisition. Second Language Research, 25, 173–227.CrossRef Google Scholar

Lardiere, D. (2004). Knowledge of definiteness despite variable article omission. In Brugos, A., Micciulla, L., & Smith, C. E. (Eds.), BUCLD 28 proceedings (pp. 328–339). Cascadilla Press.Google Scholar

Leech, G. N., Rayson, P., & Wilson, A. (2001). Word frequencies in written and spoken English: Based on the British National Corpus. Longman.Google Scholar

Lenth, R. V. (2022). emmeans: Estimated marginal means, aka least-squares means [Computer software] (R package version 1.7.4-1). R Foundation for Statistical Computing.Google Scholar

Li, C. N., & Thompson, S. A. (1981). Mandarin Chinese: A functional reference grammar. University of California Press.CrossRef Google Scholar

Liu, D., & Gleason, J. L. (2002). Acquisition of the article “the” by nonnative speakers of English: An analysis of four nongeneric uses. Studies in Second Language Acquisition, 24, 1–26.CrossRef Google Scholar

Lüdecke, D., Ben-Shachar, M. S., Patil, I., Waggoner, P., & Makowski, D. (2021). performance: An R package for assessment, comparison and testing of statistical models. Journal of Open Source Software, 6, 31–39.CrossRef Google Scholar

Lyons, C. (1999). Definiteness. Cambridge University Press.CrossRef Google Scholar

Lyons, C. (2009). Defining definiteness. In Lyons, C., Definiteness (253–281). Cambridge University Press.Google Scholar

Master, P. (1987). A cross-linguistic interlanguage analysis of the acquisition of the English article system. UCLA.Google Scholar

Master, P. (1997). The English article system: Acquisition, function, and pedagogy. System, 25, 215–232.CrossRef Google Scholar

McHugh, M. L. (2012). Interrater reliability: The kappa statistic. Biochemia Medica, 22, 276–282.CrossRef Google Scholar PubMed

Michel, M., Murakami, A., Alexopoulou, T., & Meurers, D. (2019). Effects of task type on morphosyntactic complexity across proficiency: Evidence from a large learner corpus of A1 to C2 writings. Instructed Second Language Acquisition, 3, 124–152.CrossRef Google Scholar

Murakami, A. (2013). Individual variation and the role of L1 in the L2 development of English grammatical morphemes: Insights from learner corpora. University of Cambridge.Google Scholar

Murakami, A., & Alexopoulou, T. (2016a). L1 influence on the acquisition order of English grammatical morphemes: A learner corpus study. Studies in Second Language Acquisition, 38, 365–401.CrossRef Google Scholar

Murakami, A., & Alexopoulou, T. (2016b). Longitudinal L2 development of the English article in individual learners. In Denison, S., Mack, M., Xu, Y., & Armstrong, B. C., (Eds.), Proceedings of the 38th Annual Meeting of the Cognitive Science Society (1050–1055). Cognitive Science Society.Google Scholar

Myles, F. (2012). Complexity, accuracy and fluency: The role played by formulaic sequences in early interlanguage development. In Housen, A., Kuiken, F., & Vedder, I. (Eds.), Dimensions of L2 performance and proficiency: complexity, accuracy and fluency in SLA (pp. 71–94). John Benjamins Publishing.CrossRef Google Scholar

Nagelkerke, N. J. (1991). A note on a general definition of the coefficient of determination. Biometrika, 78, 691–692.CrossRef Google Scholar

R Core Team. (2021). R: A language and environment for statistical computing. R Foundation for Statistical Computing.Google Scholar

Reid, J., Battaglia, P., Schuldt, M., Narita, E., Mochizuki, M., & Snape, N. (2006). The article choice of learners of English as a second language. University of Essex.Google Scholar

Robertson, D. (2000). Variability in the use of the English article system by Chinese learners of English. Second Language Research, 16, 135–172.CrossRef Google Scholar

Russell, B. (1905). On denoting. Mind, 14, 479–493.CrossRef Google Scholar

Scott, G. G., Keitel, A., Becirspahic, M., Yao, B., & Sereno, S. C. (2019). The Glasgow Norms: Ratings of 5,500 words on nine scales. Behavior Research Methods, 51, 1258–1270.CrossRef Google Scholar PubMed

Shatz, I. (2020). Lessons from creating a new corpus from an existing large-scale English learner language database. International Journal of Learner Corpus Research, 6, 220–236.CrossRef Google Scholar

Snape, N. (2008). Resetting the nominal mapping parameter in L2 English: Definite article use and the count–mass distinction. Bilingualism: Language and Cognition, 11, 63–79.CrossRef Google Scholar

Snape, N. (2009). Exploring Mandarin Chinese speakers’ L2 article use. In Snape, N., Leung, Y. I., & Smith, M. Sharwood (Eds.), Representational deficits in SLA: Studies in honor of Roger Hawkins (pp. 27–52). John Benjamins Publishing.CrossRef Google Scholar

Snape, N., Leung, Y. I., & Ting, H.-C. (2006). Comparing Chinese, Japanese and Spanish speakers in L2 English article acquisition: Evidence against the fluctuation hypothesis? In O’Brien, M. G., Shea, C., and Archibald, J., (Eds.), Proceedings of the 8th Generative Approaches to Second Language Acquisition Conference (GASLA 2006) (132–139). Cascadilla Proceedings Project.Google Scholar

Thomas, M. (1989). The acquisition of English articles by first- and second-language learners. Applied Psycholinguistics, 10, 335–355.CrossRef Google Scholar

Ting, H.-C. (2005). The acquisition of articles in L2 English by L1 Chinese and L1 Spanish speakers. University of Essex.Google Scholar

Trenkic, D. (2002). Form–meaning connections in the acquisition of English articles. EUROSLA Yearbook, 2, 115–133.CrossRef Google Scholar

Trenkic, D. (2007). Variability in second language article production: Beyond the representational deficit vs. processing constraints debate. Second Language Research, 23, 289–327.CrossRef Google Scholar

Trenkic, D. (2008). The representation of English articles in second language grammars: Determiners or adjectives? Bilingualism: Language and Cognition, 11, 1–18. https://doi.org/10.1017/S1366728907003185CrossRef Google Scholar

von Heusinger, K. (2019). Specificity. In Portner, P., Heusinger, K., and Maienborn, C. (Eds.), Semantics: Noun phrases and verb phrases (pp. 70–111). De Gruyter Mouton.CrossRef Google Scholar

White, L. (2003). Fossilization in steady state L2 grammars: Persistent problems with inflectional morphology. Bilingualism: Language and Cognition, 6, 129–141.CrossRef Google Scholar

Zhao, H., & MacWhinney, B. (2018). The instructed learning of form–function mappings in the English article system. Modern Language Journal, 102, 99–119.CrossRef Google Scholar

Figure 1. Example scripts with article errors marked.

Table 1. Variable coding

Table 2. Distribution of nominals retrieved

Figure 2. Observed distribution of target contexts across target, specificity, modifier, abstractness, syntactic position.

Figure 3. Development across EF levels.

Table 3. Error-type distribution

Figure 4. Error-type distribution across NL, target article, and noun type. Numbers represent instances.

Table 4. Accuracy model formula

Table 5. Accuracy model results

Figure 5. The effect of NL across definiteness and noun type.14

Figure 6. Proficiency level by NL across definiteness and noun type.

Figure 7. The effect of specificity (left) and modifier presence (right) across definiteness and noun type.

Figure 8. The effect of syntactic position.

Table 6. Error-type model formula for count singular indefinites

Figure 9. The effect of NL alone (top) and in interaction with level (bottom) on predicted probabilities of error types in count singular indefinites.

Table 7. Error-type model for count singular indefinites results

Figure 10. The effect of specificity in interaction with NL on predicted probabilities of error types in count singular indefinites.

Figure 11. The effect of specificity in interaction with modifier on predicted probabilities of error types in count singular indefinites.

Table 8. Final model formula for error-type model for indefinite mass nouns

Table 9. Error-type model for indefinite mass nouns results

Figure 12. The effects of specificity (left) and modifier presence (right) on predicted probabilities of error types in mass indefinites.

Derkach and Alexopoulou supplementary material

File 696.8 KB

Article contents

Definite and indefinite article accuracy in learner English: A multifactorial analysis

Abstract

Introduction

Background

Articles in English and other languages

Key semantic distinctions

Definiteness

Specificity

L2 article accuracy: Previous research

L1

Semantic features: Specificity

Nominal features

Research objectives

Methodology

Learner data

Corpus

Subcorpus

Coding

Statistical modeling

Binomial and multinomial mixed-effects logistic regression

Model selection

Analysis and results

Data distribution and observed accuracy rates

Predictors of accuracy

NL in interaction with proficiency level

Specificity

Modifier presence

Syntactic position

Predictors of error type

Count singular indefinites

Mass indefinitesFootnote 19

Discussion

Summary of findings

NL in interaction with proficiency, definiteness, number, and countability

Specificity and modifier in interaction with definiteness, number, and countability

Definites

Count singular indefinites

Mass indefinites

Syntactic position

Conclusion

Supplementary material

Data Availability Statement

Competing interest

Footnotes

References

Derkach and Alexopoulou supplementary material

Save article to Kindle

Save article to Dropbox

Save article to Google Drive

Reply to: Submit a response

Your details

You have entered the maximum number of contributors

Conflicting interests

Mass indefinitesFootnote ¹⁹