Introduction
Recent writing and assessment research has increasingly attended to the role of cohesion in the quality of second language (L2) written production and the ways in which cohesion can be measured in valid and reliable ways (Crossley et al., Reference Crossley, Kyle and McNamara2016b, Reference Crossley, Kyle and Dascalu2019; Zhang et al., Reference Zhang, Lu and Li2022). Cohesion is generally understood as an objective property of the explicit text that compasses the linguistic features and devices used to connect different ideas in and parts of a text; it differs from but is closely related to coherence, which refers to the overall level of connectedness, including logic, unity, and comprehensibility, of a text that is evident to its readers (Graesser et al., Reference Graesser, McNamara, Louwerse and Cai2004; Halliday & Hasan, Reference Halliday and Hasan1976). As detailed in the next section, there is good research consensus that cohesion can be assessed at three levels, namely, local cohesion, global cohesion, and text cohesion, each of which can be measured using one or more different types of indices, such as connectives, lexical overlap, semantic overlap/similarity, givenness, type-token ratio (TTR), and lexical density (Crossley et al., Reference Crossley, Kyle and McNamara2016b, Reference Crossley, Kyle and Dascalu2019). A sizable body of L2 writing and assessment studies have shown that these different levels of cohesion and/or different types of cohesion indices can significantly predict cohesion, coherence, or quality ratings of L2 written production, although their predictive power may be affected by different learner-related (e.g., proficiency level) and task-related variables (e.g., genre) (Guo et al., Reference Guo, Crossley and McNamara2013; Zhang et al., Reference Zhang, Lu and Li2022).
Despite the progress made in conceptualizing and assessing cohesion, a notable gap in existing cohesion research lies in the lack of systematic attention to lexical ambiguity in word-based indices, such as those based on connectives and lexical/semantic overlap. In the case of connective-based indices, the concern of the current study, two issues related to ambiguity exist. The first has to do with the ambiguity between discourse and non-discourse use of polysemous word forms. For example, the word once may be used as a discourse connective expressing a temporal relation (e.g., I will leave once I am done) or as an adverb meaning “one time” (e.g., The bell will ring once) or “previously” (e.g., I once really liked it). Recent research has started to attend to but has not yet fully addressed this issue (Crossley et al., Reference Crossley, Kyle and Dascalu2019) (see discussion in the next section). The second has to do with the specific discourse relation senses that polysemous discourse connectives may be used to express in context. For example, as a discourse connective, the word since may be used to express either a temporal relation (e.g., I haven’t seen him since we met in May) or contingency (You don’t have to go since you are so busy). This issue has not yet been explicitly or systematically considered in existing connective-based cohesion indices. It would appear reasonable to argue that systematic resolution of these two issues of ambiguity will yield a more accurate characterization of the use of connectives as cohesive devices in written texts and greater reliability of connective-based indices of local cohesion. Considering this research gap, the current study proposes a comprehensive set of 34 sense-aware connective-based cohesion indices that distinguish discourse and non-discourse connective forms as well as four discourse relation senses of discourse connectives (i.e., elaboration, expansion, contingency, and temporal). The study further evaluates the extent to which the proposed sense-aware indices correlate with and predict human cohesion ratings of written texts produced by young English Language Learners (ELLs) in comparison to 25 existing connective-based cohesion indices.
Measuring cohesion of written texts
Text connectedness (i.e., the degree to which different components of a text, such as clauses, sentences, paragraphs, and sections, and the information contained therein are linked) plays a critical role in the comprehensibility and processing of a text (Halliday & Hasan, Reference Halliday and Hasan1976). Central to text connectedness are the concepts of cohesion and coherence. Cohesion refers to the use of explicit linguistic and textual features to connect the ideas in different parts of a text, whereas coherence is understood not as an explicit or objective textual property but as the extent to which the text allows its readers to construct a coherent, connected mental representation of its content (Halliday & Hasan, Reference Halliday and Hasan1976). Different from cohesion, coherence cannot be achieved by linguistic or textual features alone but may interact with variables beyond the text itself, such as the reader’s background knowledge, reading skills, and language proficiency (e.g., McNamara et al., Reference McNamara, Graesser, McCarthy and Cai2014).
Scholars have proposed several frameworks or taxonomies of textual cohesion. Halliday and Hasan’s (Reference Halliday and Hasan1976) highly influential framework of textual cohesion presents five types of cohesive ties, “the means whereby elements that are structurally unrelated to one another are linked together, through the dependence of one upon another for its interpretation” (p. 27). The first four types realize what they call grammatical cohesion. These include reference (i.e., the use of personal pronouns, demonstratives, and comparatives to refer back to previously mentioned entities), substitution (i.e., the use of words such as do to replace a previous expression), ellipsis (i.e., the omission of expressions implied by the context), and conjunction (i.e., the additive, adversative, causal, and temporal conjunctive relations between sentences signaled by conjunctive elements). The last type realizes lexical cohesion, manifested in reiteration, the use of words with certain lexical relations (e.g., synonyms, hyponyms, and hypernyms), and collocational items (i.e., words that frequently co-occur).
Although echoing the distinctions made between grammatically and lexically driven cohesion and among different types of conjunctive relations by Halliday and Hasan (Reference Halliday and Hasan1976), Louwerse (Reference Louwerse2002) proposed the additional view that cohesion can be achieved at local, global, and text levels. Local cohesive indices are used to connect clauses or sentences, such as explicit connectives (e.g., while, therefore), lexical overlap between sentences, and semantic overlap/similarity between sentences. Global cohesive indices are used to connect paragraphs in a text, such as lexical overlap between paragraphs and semantic overlap/similarity between paragraphs. Text cohesive devices are used to build connections throughout the text, such as givenness features (e.g., the use of a pronoun to refer to a noun referent after its initial mention). With the high level of operationalizability of this view, numerous indices that tap into the use of different types of cohesive devices have been proposed to measure cohesion at these levels, such as those integrated in two widely used computational tools for cohesion analysis, namely, Coh-Metrix (Graesser et al., Reference Graesser, McNamara, Louwerse and Cai2004) and the Tool for the Automatic Analysis of Cohesion (TAACO) (Crossley et al., Reference Crossley, Kyle and McNamara2016b, Reference Crossley, Kyle and Dascalu2019). For example, TAACO 2.0 includes 25 connective-based indices (e.g., number of causal connectives), 108 indices of lexical overlap between sentences (e.g., number of lemma types that occur at least once in the next sentence) or paragraphs (e.g., number of lemma types that occur at least once in the next paragraph), 16 indices of semantic overlap/similarity between sentences or paragraphs (e.g., average sentence to sentence overlap of noun synonyms), four givenness indices (e.g., number of third-person pronouns divided by number of nouns), and 15 TTR (e.g., lemma TTR) and lexical density indices (e.g., ratio of content words). It also includes 26 source text similarity indices (e.g., percentage of unigrams in the text that are keywords) for evaluating the similarity between a source text and a target text in source-based writing tasks.
A substantial body of second language acquisition (SLA) research has argued for and offered empirical evidence of the role of textual cohesion in L2 comprehension and production, two critical aspects of SLA. Indeed, cohesive devices can help L2 learners establish connections between ideas and follow the information flow more easily in understanding a L2 text. They also serve as a necessary tool for L2 learners to express their thoughts logically and coherently in L2 production. SLA research on the role of textual cohesion in L2 comprehension has reported that L2 readers rely on referential and lexical cohesion to a far larger extent than first language (L1) speakers in comprehension processes (Jonz, Reference Jonz1987), that L2 readers benefit from causal markers (Degand & Sanders, Reference Degand and Sanders2002) and awareness of lexical cohesive links in reading comprehension (Bayraktar, Reference Bayraktar2011), and that content word overlap affects L2 reading not only in localized processing but also in overall comprehension (Biler, Reference Biler2018). On the other hand, L2 learners have been found to exhibit more homogeneous processes of meaning representation when reading high-cohesive texts and more heterogenous ones when reading low-cohesive texts (Bilki, Reference Bilki2014). These findings confirm that different types and density levels of cohesive devices can affect L2 learners’ comprehension and meaning representation processes in multifaceted ways.
Just as L2 learners’ comprehension processes are affected by different types and density levels of cohesive devices, the types and density levels of cohesive devices used by L2 learners could be expected to affect the coherence and comprehensibility of their L2 production, which could in turn affect ratings of the cohesion, coherence, or overall quality of their production in the context of L2 assessment. A number of L2 writing studies have reported findings precisely on the relationship of different types of local, global, and text cohesion indices to human ratings of cohesion, coherence, or writing quality. These findings have shown that different levels of cohesion may be related to cohesion, coherence, or quality ratings in different ways. For example, several studies have reported that local cohesion indices (e.g., conditional connectives and lexical overlap) tend to be negatively correlated with quality ratings of L2 writing, whereas global cohesion indices (e.g., lexical overlap between paragraphs) tend to show positive correlations (e.g., Crossley & McNamara, Reference Crossley and McNamara2012; Guo et al., Reference Guo, Crossley and McNamara2013; Kim & Crossley, Reference Kim and Crossley2018). These findings point to the need to disentangle the effect of different levels of cohesion on writing quality. However, some studies have found variation in how different types of indices at the same level of cohesion may be related to cohesion, coherence, or quality ratings. For example, Crossley et al. (Reference Crossley, Kyle and McNamara2016a) reported that while some connective-based local cohesion indices (e.g., positive intentional connectives) showed negative correlations with quality ratings of timed descriptive essays written by college-level L2 English learners, several others (e.g., positive causal connectives) showed positive correlations, highlighting the importance to differentiate among different types of connectives in using connective-based cohesion indices. Genre appears to be another important factor that needs to be considered in examining the relationship between indices of cohesion and cohesion, coherence, or quality ratings. For example, Zhang et al. (Reference Zhang, Lu and Li2022) examined the relationship of six cohesion indices from TAACO to the quality ratings of two genres of writing by L1 Chinese college-level learners of English, namely, application letters and argumentative essays. They reported positive correlations for two text cohesion indices (moving-average TTR and the lexical density of word types) for both genres, negative correlations for lemma overlap between adjacent sentences and paragraphs for argumentative essays, and no significant correlation for either positive or negative logical connectives for either genre. Taken together, these studies have showcased the increasing attention to the validation of cohesion indices and the relationship of such indices to human ratings of cohesion, coherence, or quality in recent L2 writing research. They also informed our focus on a specific type of indices at one level of cohesion (i.e., connective-based indices of local cohesion) in the current study, our attention to different types of connectives, and the use of essays of a single genre (i.e., argumentative essays) in our analysis.
Notably, existing cohesion research has not yet systematically attended to issues of lexical ambiguity in developing and validating indices based on word forms. For example, indices based on lexical or semantic overlap rely on matches of either the same lemma forms (in lexical overlap indices) or lemma forms of synonyms (in semantic overlap indices). However, even within the same text, the same word may be used with different meanings, and the degree of lexical ambiguity within the text may affect the reliability of such lexical and semantic overlap indices. As a step toward addressing this research gap, the current study focuses on resolving two issues of ambiguity for connective-based indices, a type of index that has been used extensively in cohesion research. As mentioned previously, the first issue involves the ambiguity between discourse and non-discourse use of polysemous word forms (e.g., so as a connective and an intensifier). Although prior cohesion research largely ignored this issue, Crossley et al. (Reference Crossley, Kyle and Dascalu2019, p. 17) used the Stanford Neural Network Dependency Parser (Chen & Manning, Reference Chen and Manning2014) “to disambiguate word forms that can be used as both cohesive devices and for other purposes” in developing TAACO 2.0. However, as specified in the TAACO 2.0 manual found at the TAACO website,Footnote 1 in each of the various lists of different types of connectives, only a small number of connectives that receive the “mark” tag in the Stanford dependency representation are disambiguated. For example, the list of “positive causal connectives” contains 41 items, among which only two items (since, so) are disambiguated this way, whereas many other potentially ambiguous word forms (e.g., condition, due, even, follow, make, only, following) are not. As such, the effort to resolve the first ambiguity issue remains partial. The second issue has to do with the specific discourse relation senses that polysemous discourse connectives may be used to express in context. This issue has not yet been systematically addressed in existing cohesion research. In TAACO 2.0, for example, the list of temporal connectives includes all occurrences of while, even though it does not always mark a temporal relation. A more systematical approach to addressing these two ambiguity issues will help improve and validate the reliability of connective-based cohesion indices.
Current study
Considering the research gaps discussed previously, this study proposes a comprehensive set of 34 sense-aware connective-based cohesion indices and evaluates their correlations with and predictive power for cohesion ratings of ELLs’ written production in comparison to 25 existing connective-based indices. Within the context of the current study, we use the term “discourse relation sense” to refer to the specific type of discourse relation signaled by a discourse connective (see discussion in the Method section), and our sense-aware indices account for discourse versus non-discourse uses of connectives as well as the specific discourse relations expressed by explicit discourse connectives used in written texts. The two specific research questions addressed are:
-
1. How do existing and sense-aware connective-based cohesion indices correlate with cohesion ratings of young ELLs’ written production?
-
2. How do existing and sense-aware connective-based cohesion indices predict cohesion ratings of young ELLs’ written production?
Method
ELL writing data
The writing data used in the current study consisted of the full training dataset of the Kaggle Feedback Prize English Language Learning Competition (Vanderbilt University & The Learning Agency Lab, 2022). The goal of the competition was to develop effective automated essay scoring and feedback tools for ELLs. The dataset was part of the larger English Language Learner Insight, Proficiency and Skills Evaluation (ELLIPSE) CorpusFootnote 2 released in 2023, which contains about 6,500 independent essays written on 44 different prompts by ELLs in the United States with diverse backgrounds in terms of gender, race/ethnicity, grade level, and economic disadvantage. The dataset used for the 2022 Kaggle competition and in the current study included 3,911 argumentative essays written by 8th to 12th grade ELLs in the United States as part of state standardized writing assessments from the 2018 to 2019 and 2019 to 2020 school years. This dataset contained a total of 1.683 million words, with an average of 430.372 words per essay (standard deviation [SD] = 191.974). Each essay was scored on a scale of 1.0 to 5.0 (with 0.5 increments) for each of the following six analytic measures: cohesion, syntax, vocabulary, phraseology, grammar, and conventions. The essays were rated by a pool of 26 raters recruited and trained by the corpus compilers. Most raters were senior undergraduate students or graduate students in an applied linguistics department, and all raters had experience teaching English as a second language. The corpus compilers adopted a double-blind rating process with 100% adjudication to ensure rating reliability, with each essay independently reviewed by two raters and adjudicated by a third one when necessary; a Many-Facet Rasch Measurement analysis was also conducted for the raters and texts to confirm the reliability of the ratings.Footnote 3 Only the cohesion scores were used in the current study. The rating rubric for all measures can be found at the ELLIPSE Corpus dataset,2 which specifies that a cohesion score of 5 corresponds to the following: “Text organization consistently well controlled using a variety of effective linguistic features such as reference and transitional words and phrases to connect ideas across sentences and paragraphs; appropriate overlap of ideas.” The average cohesion scores of the essays in the dataset was 3.127 (SD =. 663).
Connective identification and disambiguation
Each text in the dataset was processed in three steps. First, we used LanguageTool,Footnote 4 an open-source grammar checker, to automatically correct spelling, grammar, and punctuation errors in each text. This step served to help minimize issues such errors might pose to syntactic parsing. Second, we used Stanford CoreNLP 3.6.0Footnote 5 (Manning et al., Reference Manning, Surdeanu, Bauer, Finkel, Bethard and McClosky2014) to perform constituency parsing on each text. This step outputs a constituency parse (i.e., a phrase-structure tree) for each sentence in the text that captures the hierarchical relations among the sentence’s constituents. Finally, we used the Explicit Discourse Connectives TaggerFootnote 6 (EDCT; Pitler & Nenkova, Reference Pitler and Nenkova2009) to automatically identify and disambiguate explicit connectives in each parsed text. The EDCT takes a constituency-parsed text as input and augments the constituency parses by annotating each connective with a tag indicating that it is either a non-discourse connective (Non-DC) or a discourse connective (DC) with a specific discourse relation sense.
The EDCT was trained and evaluated using data from the Penn Discourse Treebank (PDTB) Version 2.0 (Prasad et al., Reference Prasad, Dinesh, Lee, Miltsakaki, Robaldo, Joshi and Webber2008), a version of the one-million-world Wall Street Journal Corpus annotated for discourse relations and their arguments. Prasad et al. (Reference Prasad, Dinesh, Lee, Miltsakaki, Robaldo, Joshi and Webber2008) understood discourse relations as holding “between two and only two arguments” and characterized arguments as “abstract objects” that are commonly expressed in single clauses or sentences but can also be associated with multiple clauses or sentences or be denoted by non-clausal units such as discourse deictics (i.e., this, that) that refer to abstract objects or nominalizations with an event interpretation (e.g., their failure to pass the exam) (p. 2962). Discourse connectives are thus explicit linguistic expressions (both single words and multiple-word expressions) that signal discourse relations between two arguments. Each discourse relation was annotated by marking the discourse connective signaling it (e.g., because), labeling the discourse connective with its discourse relation sense, and annotating the attributes of its arguments. Altogether, the corpus contains annotations of 18,459 instances of 100 different explicit discourse connectives. The hierarchical taxonomy of discourse relation senses includes four top-level semantic classes, namely, Expansion (information in one clause elaborates that in the other), Contingency (information in one clause expresses the cause of that in the other), Comparison (information in one clause is compared or contrasted with that in the other), and Temporal (information in one clause is temporally related to that in the other) (see examples in Table 1). The PDTB also provides two lower-level (i.e., type and subtype) annotations for each top-level class. For example, Temporal has two types, namely, Asynchronous and Synchronous, and the latter has two subtypes, namely, Precedence and Succession. The full hierarchal taxonomy can be found in Prasad et al. (Reference Prasad, Dinesh, Lee, Miltsakaki, Robaldo, Joshi and Webber2008, p. 2965). The EDCT annotates each instance of a discourse connective with one of the four top-level discourse relation senses only. These categories are theoretically meaningful as they align with the four conjunctive relations (i.e., additive, causal, adversative, and temporal) distinguished in Halliday and Hasan’s (Reference Halliday and Hasan1976) framework and concurred by Louwerse (Reference Louwerse2002).
Pitler and Nenkova (Reference Pitler and Nenkova2009) reported an accuracy of. 9626 for distinguishing DCs versus Non-DCs and an accuracy of. 9415 for classifying DCs into the four discourse relation senses. We evaluated the performance of the EDCT for tagging the essays produced by ELLs on 30 texts randomly sampled from our dataset, with 10 from each of the following three score bands: low (1–2 points, N = 352), mid (2.5–3.5 points, N = 2,874), and high (4–5 points, N = 685). Altogether, these 30 texts contained 1,005 connective tokens. One researcher and a trained graduate student assistant collaboratively annotated all 1,005 connective tokens manually. We first labeled each connective as either clear/proper (N = 899 or 89.5%) or unclear/improper (N = 106 or 10.5%). An instance was considered unclear/improper if the two annotators agreed that the discourse relation of the connective was either difficult to be determined from the context or improper for the context. For example, the annotators labeled the word but in “We had this sorta thing to but easier” as unclear/improper, which the EDCT labeled as Comparison. Notably, most unclear/improper instances were found in the 10 low-scoring samples, which represented 9.0% (i.e., 352 of 3,911) of the full dataset. In the 20 mid- and high-scoring samples, which represented 91.0% (i.e., 3,559 of 3,911) of the full dataset, fewer than 5% of the connectives were labeled as unclear/improper. Although in most cases the tags assigned to such unclear/improper instances by the EDCT appeared to align with the most likely senses intended by the learners, we decided to exclude them in reporting the accuracy of the EDCT to ensure the reliability of our evaluation. Next, we labeled each clear/proper instance with one of the five tags in the EDCT tagset (i.e., Non-discourse, Expansion, Contingency, Comparison, and Temporal). A comparison of the tags assigned by the EDCT and the annotators to the 899 clear/proper instances revealed an overall accuracy of 90.4% (i.e., 813 of 899) of the EDCT. More details about the accuracy of the EDCT on samples from each score band and its precision, recall, and F-score for each connective type can be found in Appendix S1.
Sense-aware connective-based cohesion indices
Based on the annotations provided by the EDCT, we proposed 34 sense-aware connective-based indices of local cohesion. Different from the approaches taken in previous research, we distinguished DCs from Non-DCs and further differentiated four subcategories of DCs based on the discourse relation senses they were used to express in written texts. For Non-DCs and each of the four subcategories of DCs, we computed the following six indices for each text: (i) token density, the number of connective tokens of this category divided by the number of word tokens; (ii) TTR, a measure of connective diversity calculated by dividing the number of connective types of this category by the number of connective tokens of this category; (iii) type density, the number of connective types of this category divided by the number of word types; (iv) type number, the number of connective types of this category; (v) token ratio, the ratio of connective tokens of this category among all connective tokens; and (vi) type ratio, the ratio of connective types of this category among all connective types. For DCs, we computed the first four indices only, as the token and type ratios would be redundant of those for Non-DCs. Table 1 summarizes the 34 proposed indices. The 34 indices were computed for each text using a script written in Python 3. The Python scripts as well as all the extracted indices are openly available at Github.Footnote 7
Connective-based cohesion indices from TAACO
To evaluate the correlations with and predictive power for cohesion score of the 34 sense-aware connective-based cohesion indices proposed in the current study in comparison to and in combination with existing connective-based cohesion indices, we used all 25 connective-based cohesion indices from TAACO 2.0. According to Crossley et al. (Reference Crossley, Kyle and McNamara2016b, pp. 1231–1232):
“Many of the connective indices are similar to those found in Coh-Metrix (McNamara et al., Reference McNamara, Graesser, McCarthy and Cai2014) and are theoretically based on two dimensions. The first dimension contrasts positive versus negative connectives, and the second dimension is associated with the particular classes of cohesion identified by Halliday and Hasan (Reference Halliday and Hasan1976) and Louwerse (Reference Louwerse2001), such as temporal, additive, and causative connectives.”
Table 2 summarizes the 25 connective-based indices from TAACO along with their descriptions and examples of the corresponding DCs involved. These indices are computed with a list-based approach. For a particular list, the frequency of occurrence of each list item in the text is counted, and the sum of the frequencies of all list items are then tallied and divided by the total number of words in the text, with the exception that some words are disambiguated using the Stanford dependency parser (Chen & Manning, Reference Chen and Manning2014). More detailed descriptions of these indices can be found in Crossley et al. (Reference Crossley, Kyle and McNamara2016b, Reference Crossley, Kyle and Dascalu2019) and in the TAACO 2.0 manual.
Statistical analysis
To address research question 1, we performed Pearson correlation analyses between the 25 connective-based indices from TAACO and cohesion score as well as between the 34 sense-aware connective-based indices proposed in the current study and cohesion score. In addition to statistical significance (p <. 05), we interpret correlation coefficients with at least a small effect size (|r| ≥. 1) as meaningful (Cohen, Reference Cohen1988). To address research question 2, we performed three sets of regression analyses. In the first set of analysis, all connective-based indices from TAACO were used to predict cohesion score. In the second set of analysis, all sense-aware connective-based indices were used to predict cohesion score. In the third set of analysis, all connective-based indices from TAACO and all sense-aware connective-based indices were used to predict cohesion score.
As noted by one reviewer, some scholars have pointed out that the commonly used procedures of predictor preselection based on bivariate correlations and of stepwise variable selection in regression modeling could result in the exclusion of useful predictors that may affect the outcome variable together with other predictors (e.g., Smith, Reference Smith2018; Sun et al., Reference Sun, Shook and Kay1996). Ferenci (Reference Ferenci2017) recommended that all potential confounders be included, or their selection be blinded to the outcome when constructing models. In light of these concerns and recommendations, in each of the three sets of analysis, we report the results from four regression models that do not require feature preselection and that collectively capture both linear and nonlinear relationships between the predictors and the outcome variable. These include the linear regression model without feature preselection, two Bayesian regression models that address multicollinearity and overfitting through regularization (i.e., Bayesian automatic relevance determination regression and Bayesian ridge regression), and a nonlinear ensemble-learning model (i.e., random forest regression). These models were built using the scikit-learn library in Python 3 (https://scikit-learn.org/stable/user_guide.html). In our experiments, we randomly divided the dataset into a training set (2,620 essays, or two thirds) and a test set (1,291 essays, or one third). Each model was trained on the training set and used to predict the cohesion scores of the essays in the test set. The performance of these models on the test set was evaluated using the Pearson correlation coefficient and root mean squared error (RMSE) between the predicted and actual cohesion scores as well as R2 and adjusted R2.
Results
Correlation analysis
Table 3 presents the descriptive statistics of the 25 connective-based indices from TAACO as well as their correlations with cohesion scores, ranked by the absolute values of the correlation coefficients. The distribution plots of these indices are provided in Appendix S2. Among the 25 indices, only three exhibited significant and meaningful correlations (|r| ≥. 1, p <. 05) with cohesion scores, namely, number of sentence linking words (r = $ - $ .123, p <.001), number of positive causal connectives (r = $ - $ .110, p <. 001), and number of basic connectives (r = $ - $ .101, p <. 001). These correlations were all negative and small. Another 10 indices exhibited significant but trivial correlations (|r| <. 1, p <. 05).
Note: Num = number. SD = standard deviation. Bolded r values indicate significant and meaningful correlations (|r| ≥. 1, p <. 05).
Table 4 presents the descriptive statistics of the 34 sense-aware connective-based indices proposed in the current study as well as their correlations with cohesion scores, ranked by the absolute values of the correlation coefficients. The distribution plots of these indices are provided in Appendix S2. Among these indices, 23 exhibited significant and meaningful correlations (|r| ≥. 1, p <. 05) with cohesion scores. Notably, two indices achieved medium effect sizes (|r| ≥. 3) (Cohen, Reference Cohen1988), namely, DC_type_num (r =. 381, p <.001) and Expansion_type_num (r =. 330, p <.001), and 16 indices exhibited stronger correlations than all 25 connective-based indices from TAACO (i.e., |r| >. 123). Another six indices exhibited significant but trivial correlations (|r| <. 1, p <. 05).
Note: Num = number. SD = standard deviation. Bolded r values indicate significant and meaningful correlations (|r| ≥. 1, p <. 05).
Regression analysis
Table 5 summarizes the performance on the test set of the four regression models trained on the training set with different sets of indices. With the 25 connective-based indices from TAACO, the cohesion scores predicted by the four regression models showed correlation coefficients ranging from. 275 (Bayesian ridge regression) to. 332 (random forest regression) with human-rated cohesion scores. These were outperformed by all four regression models trained on the 34 sense-aware connective-based indices, whose predicted cohesion scores achieved correlation coefficients between. 402 (random forest regression) and. 414 (Bayesian ridge regression) with human-rated cohesion scores. When all 59 connective-based indices were included, the performance of the four regressions models further improved, with their predicted cohesion scores showing correlation coefficients ranging from. 436 (random forest regression) to. 447 (linear regression) with human-rated cohesion scores. The same patterns of performance changes were observed for RMSE, R2, and adjusted R2. Overall, the models trained on the sense-aware connective-based indices performed better in predicting the cohesion scores on the test set than those trained on the connective-based indices from TAACO, and the models trained on all 59 indices achieved yielded the most accurate predictions.
Note: ARD = automatic relevance determination. RMSE = root mean squared error.
Discussion
In light of the observation that existing cohesion indices have not yet systematically addressed issues related to lexical ambiguity, this study proposed a set of 34 sense-aware connective-based indices of cohesion based on the annotations provided by the Explicit Discourse Connective Tagger (Pitler & Nenkova, Reference Pitler and Nenkova2009), which allowed us to differentiate discourse versus non-discourse uses of explicit connectives as well as the specific discourse relation senses expressed by discourse connectives in context. We further examined their correlations with and predictive power for cohesion ratings of argumentative essays written by 8th to 12th grade ELLs in the United States both in comparison to and in combination with 25 connective-based indices from TAACO 2.0. Our analyses yielded a number of substantive findings with useful implications for cohesion research.
The results pertaining to our two research questions indicate that the sense-aware connective-based indices of cohesion are more strongly correlated with and better predictors for cohesion scores than existing connective-based indices. Three of the 25 connective-based indices from TAACO exhibited significant and meaningful correlations with cohesion scores, all with small effect sizes. In contrast, 23 of the 34 sense-aware connective-based indices proposed in the current study showed significant and meaningful correlations with cohesion scores, two with medium effect sizes and 21 with small effect sizes. Among these, 16 sense-aware indices showed stronger correlations with cohesion scores than all 25 TAACO indices. The four regression models trained with the 34 sense-aware connective-based indices on the training set all performed better in predicting the cohesion scores on the test set than all four regression models trained with the 25 connective-based indices from TAACO.
The stronger correlational relationship with and greater predictive power for cohesion score achieved by the sense-aware connective-based indices may be attributed to two main factors. First, the TAACO indices were all based on normalized frequency counts (similar to our density indices), whereas the sense-aware indices captured several additional aspects of connective use, including total type frequency counts, connective diversity (i.e., the TTR indices), and connective complexity (i.e., the ratio indices). Our findings showed that these different aspects of connective use all contributed useful information for assessing cohesion. As noted by one reviewer, the indices based on total type frequency counts are likely affected by text length and may therefore be measuring both text length and cohesion. These indices were examined in the current study along with the density indices because they could more directly reflect the full range of discourse connectives produced by learners than the density indices. For example, a 100-word text with 10 connective types and a 150-word text with 15 connective types would have the same DC_type_density (i.e., 10 per 100 words). Although normalization accounts for text length, it may also conceal some differences in the learners’ productive ability, and it is an empirical question whether raters may be sensitive to the actual range of connective types produced by the learners writing for the same tasks. The weak correlation between text length (i.e., number of words per text) and cohesion scores (r =. 222, p <. 001) suggests that some of the high correlations exhibited by some of type_num indices could not all be the result of increased length but also reflected the raters’ sensitivity to the absolute range of connective types used. Lu (Reference Lu2012) also found that when evaluating timed speech samples produced by L2 speakers on the same tasks, the number of different words, a measure of the full range of word types produced, usefully complemented lexical diversity indices based on normalized frequencies in predicting quality ratings. Critically, however, the use of such type_num indices should only be considered along with density indices when evaluating samples produced for the same writing task(s) using the same rubric, as Lu (Reference Lu2012) also recommended.
Second, and more importantly, the discourse relation sense disambiguation of the connectives likely helped improve the reliability of the connective-based indices. A review of the lists of different types of connectives used to compute the connective-based indices in TAACO shows that the lexical ambiguity of connective word forms could introduce noise into those indices. With only partial ambiguity resolution of some polysemous connective word forms, a connective word form with a non-discourse use may be counted as one with a discourse use, and a discourse connective used with one discourse relation sense may be counted as one used with a different discourse relation sense. Such noise could affect the reliability of the indices computed and subsequently weaken their ability to assess what they were designed to measure, namely, local cohesion achieved through the use of connectives. Examples 1 and 2 illustrate how the EDCT helped us address these issues by tagging Non-DCs with the #Non-DC tag and differentiating among different discourse relation senses of DCs. For instance, In Example 1, the EDCT tagged the first instance of and as a Non-DC and the second instance as an Expansion connective. In Example 2, it tagged the first instance of while as a Temporal connective (and, importantly, not a Comparison connective) and the second instance as a Non-DC.
Example 1. The school board shouldn’t extend the school day by adding one and#Non-DC a half hours because#Contingency the students won’t have time for#Non-DC themselves at home, the classes would take longer, and#Expansion the school will end at evening time.
Example 2. For example#Expansion, sometimes even I get distracted while#Temporal doing my work at home… If#Contingency you are at home for#Non-DC 8 hours doing work, you wouldn’t want to spend time relaxing at your home because#Contingency eventually it would get boring there after#Non-DC a while#Non-DC .
A comparison of the TAACO indices and the sense-aware density indices, both based on normalized frequency counts, can shed some light on the positive effect of discourse relation sense disambiguation on the connective-based indices. As indicated in Tables 3 and 4, nine of the 12 sense-aware density indices but only three of the 25 TAACO indices achieved significant and meaningful correlations (|r| ≥. 1, p <. 05) with cohesion scores. Furthermore, five sense-aware density indices showed stronger correlations than all 25 TAACO indices. These differences suggest a positive effect of discourse relation sense disambiguation on the connective-based indices. To further isolate the effect of discourse relation sense disambiguation, we calculated the non-sense-aware counterparts of the following four sense-aware indices reflecting overall discourse connective use: number of DC types, density of DC types and tokens, and TTR. To this end, we generated a list of all connectives based on the tags assigned by the EDCT and used this list to obtain frequency counts of connective types and tokens in each sample, similar to how most connectives were counted in TAACO. This means Non-DCs were counted as discourse connectives as well. Table 6 presents the descriptive statistics of the non–sense-aware and sense-aware indices of overall connective use and their correlations with cohesion scores. Without differentiating Non-DCs from DCs, the three non–sense-aware frequency and density indices all exhibited higher means and SDs than their sense-aware counterparts. The lower non–sense-aware TTR value could be attributed to the larger effect of overcounting on connective tokens than on connective types. Only one non–sense-aware index (vs. three sense-aware indices) achieved significant and meaningful correlations with cohesion scores (|r| ≥. 1, p <. 05). The three sense-aware frequency and density indices all showed stronger correlations with the cohesion scores than their non–sense-aware counterparts. These findings further confirm the positive effect of discourse relation sense disambiguation on connective-based indices. They also echo those from Lu and Hu (Reference Lu and Hu2022), who reported superior performance of sense-aware frequency-based indices of lexical sophistication that accounted for the specific senses with which polysemous words are used for predicting holistic ratings of L2 English writing quality over existing frequency-based lexical sophistication indices that did not account for lexical ambiguity.
Note: DC = discourse connective. SD = standard deviation. * p <. 05, *** p <. 001.
The three TAACO indices that were significantly and meaningfully correlated with cohesion scores, namely, sentence linking connectives (e.g., although, therefore, for, then, while, so, since, as, after), positive causal connectives (e.g., arise, cause, condition, consequence, make, result), and basic connectives (i.e., for, and, nor, but, or, yet, so), all exhibited negative correlations. This result suggests that argumentative essays with higher cohesion ratings would contain fewer such connectives. Multiple previous studies have also reported negative correlations between connective-based indices of local cohesion and cohesion, coherence, or quality ratings of L1 and L2 writing and have often explained the negative correlations with the increased use of other types of local cohesive devices (e.g., lexical/semantic overlap and semantic similarity) in higher-rated writing or by more advanced learners in place of connectives (Crossley & McNamara, Reference Crossley and McNamara2010, Reference Crossley, McNamara, Carlson, Hoelscher and Shipley2011, Reference Crossley and McNamara2012; Guo et al., Reference Guo, Crossley and McNamara2013; Kim & Crossley, Reference Kim and Crossley2018). Meanwhile, among the 23 sense-aware connective-based indices that showed significant and meaningful correlations with cohesion scores, only four Non-DC indices (i.e., the ratio and density of Non-DC types and tokens) exhibited negative correlations, whereas the other 19 DC-based indices all exhibited positive correlations. These results suggest that the negative correlations reported between connective-based indices and cohesion or coherence ratings in previous studies could have arisen from the noise in those indices with the confusion of discourse and non-discourse uses of connective word forms and the different discourse relation senses of discourse connectives. When these confusions were removed, higher-rated argumentative essays produced by young ELLs tended to use more instead of fewer DCs, as indicated by the positive correlations between cohesion scores and the three indices gauging the number of DC types and the density of DC types and tokens. The TTR of DCs, however, showed no significant correlation with cohesion scores. In terms of the four subtypes of DCs, indices related to Comparison and Expansion DCs exhibited stronger correlations with cohesion scores than those related to the other two types, with positive correlations found for all six indices related to the former and five indices related to the latter (i.e., all but the TTR of Expansion DCs). These findings show that higher-rated essays contained more Comparison and Expansion DC types, greater ratios and density of Comparison and Expansion DC types and tokens, and more diverse Comparison DCs. Furthermore, four indices related to Temporal DCs also showed significant correlations (i.e., all but the ratio and density of Temporal DC tokens), indicating that higher-rated essays also contained a greater number, ratio, and density of Temporal DC types as well as more diverse Temporal DCs. Finally, indices related to Contingency DCs showed the weakest correlations with cohesion scores overall, with positive correlations found only for the number of Contingency DC types, whereas the other five showed no meaningful correlations. All in all, these findings show that increased uses of the four subtypes of DCs did not negatively affect cohesion score in any way but positively affected cohesion score in several ways in our dataset.
The models trained on all TAACO and sense-aware indices outperformed those trained on either set of indices alone, indicating that the two sets of indices can complement each other in useful ways. In particular, the subcategories of connectives involved in the three TAACO indices with significant and meaningful correlations with cohesion scores are either different from (sentence linking connectives and basic connectives) or more fine-grained (positive causal connectives) than the subcategories differentiated in the sense-aware indices. These findings suggest that the sense-aware indices could be potentially enhanced with additional and/or more fine-grained ways for categorizing the discourse connectives.
In summary, our analysis has provided evidence for the value of addressing discourse relation sense ambiguity and integrating discourse relation sense information in connective-based indices of cohesion. The most important implication of the current study for future cohesion research is to systematically distinguish discourse and non-discourse uses of connective word forms and the specific discourse relation senses with which DCs are used in context. By extension, other types of cohesion indices, such as lexical/semantic overlap indices of local and global cohesion, may benefit from word sense disambiguation as well, as the overlap or repetition of two identical or potentially synonymous words with two different or unrelated senses does not contribute to cohesion. Theoretically, our findings on the predictive power of models trained with sense-aware connective-based indices for cohesion score add evidence to prior SLA literature on the role of local cohesion in L2 written production. Along with similar findings from previous studies, the differential degrees of correlations between the indices based on different types of connectives and cohesion score offer empirical support for the usefulness of taxonomies of discourse or conjunctive relations in analyzing local cohesion of L2 writing (Halliday & Hasan, Reference Halliday and Hasan1976; Louwerse, Reference Louwerse2002; Prasad et al., Reference Prasad, Dinesh, Lee, Miltsakaki, Robaldo, Joshi and Webber2008). Meanwhile, given the discrepancy in the polarity of the correlational relationship with cohesion scores found for existing and sense-aware connective-based cohesion indices, it would appear useful to revalidate some of the findings reported in previous research on the correlational relationship between connective-based indices of local cohesion and cohesion, coherence, or quality ratings of adult L1 and L2 writing. Our findings also confirm the importance of appropriate use of DCs expressing different discourse relations in writing pedagogy and assessment for young ELLs. In particular, the types and ratios of Expansion and Comparative connectives exhibited stronger correlations with cohesion score than those of Contingency and Temporal connectives, suggesting potentially greater relative importance of Expansion and Comparative connectives in achieving local cohesion in argumentative writing. It is also important to help learners to expand the repertoire of connectives of each type and to develop the ability to choose appropriate and diverse connectives to express precise discourse relations in context without relying on repetitive uses of a small set of basic ones.
Conclusion
This study proposed 34 new sense-aware connective-based cohesion indices that considered the discourse and non-discourse uses of connective word forms and the specific discourse relation senses of discourse connectives in context. To this end, we used the Explicit Discourse Connective Tagger to identify explicit connectives from the text and tag each explicit connective as either a Non-DC or a DC with one of four discourse relation senses (i.e., Comparison, Contingency, Expansion, Temporal). This allowed us to distinguish DCs more systematically from Non-DCs and determine the specific discourse relation sense expressed by each DC in context. Results showed that 16 of the 34 indices we proposed exhibited stronger correlations with cohesion ratings of argumentative essays produced by young ELLs than all 25 connective-based indices from TAACO, that models trained with the sense-aware indices showed stronger predictive power for cohesion score than those trained with the connective-based indices from TAACO, and that models trained with all TAACO and sense-aware indices exhibited greater predictive power for cohesion score than those trained using either set of indices alone. Our findings highlight the value of integrating sense-level information in developing cohesion indices. As one reviewer suggested, future research could train or fine-tune large language models using the PDTB to improve the accuracy and granularity (using the full hierarchical taxonomy adopted in the corpus) of discourse relation sense tagging. Future cohesion research could also apply sense-aware indices to re-examine the relationship of connective-based cohesion indices to cohesion, coherence, or quality ratings of writing produced by other learner populations and/or for other types of writing tasks and, more importantly, investigate how sense disambiguation could be integrated into other types of cohesion indices based on word forms, such as lexical and semantic overlap indices of local and global cohesion.
Supplementary material
The supplementary material for this article can be found at http://doi.org/10.1017/S0272263124000202.
Acknowledgments
This research was supported by a grant from the Center for Language Education and Cooperation at the Ministry of Education of China (No. 22YH04ZW), a grant from the National Language Commission of China (No. ZDA145-9), and a grant from the Beijing Federation of Social Science Circles (No. 21DTR037). It is also supported by the Fundamental Research Funds for the Central Universities of China.
Competing interest
The authors declare no competing interests.
Data availability statement
The experiment in this article earned Open Data and Materials badges for transparent practices. The data and materials are available at https://github.com/iris2hu/sense-aware-cohesion.