1. Introduction
Disclaimer: Parts of this section highlighted in italics are generated by ChatGPT to illustrate the need for facilitating research in the detection of neural texts. We guarantee that the generated text contains no misinformation and provide it solely for illustration purposes.
Modern text generation models (TGMs) excel at producing text that can be indistinguishable from human-written texts, judging by its fluency, coherence, and grammar (Zellers et al. Reference Zellers, Holtzman, Rashkin, Bisk, Farhadi, Roesner and Choi2019; Radford et al. Reference Radford, Wu, Child, Luan, Amodei and Sutskever2019). While advanced TGMs are useful for many real-world applications, such as text summarisation or machine translation, their risks are viewed as critical (Weidinger et al. Reference Weidinger, Mellor, Rauh, Griffin, Uesato, Huang, Cheng, Glaese, Balle and Kasirzadeh2021; Bommasani et al. Reference Bommasani, Hudson, Adeli, Altman, Arora, von Arx, Bernstein, Bohg, Bosselut and Brunskill2021). Here are some of the most significant ones:
1. Biases and Stereotypes: Large-scale TGMs are trained on vast amounts of data, which means that they can replicate existing biases and stereotypes that are present in the data. For example, if the training data contain gender bias or racial bias, the text generated by the model may also contain such biases.
2. Misinformation and Fake News: Since TGMs can create convincing and coherent sentences, they can also be used to generate false information and spread misinformation or fake news. This could have serious consequences, such as spreading rumours, influencing elections, and inciting violence.
3. Malicious Use: Large-scale TGMs could be used for malicious purposes, such as generating convincing phishing emails, creating fake reviews to manipulate consumer opinion, or even creating convincing fake personas to spread propaganda.
4. Ethical Considerations: The use of large-scale TGMs raises ethical considerations around the use of data, privacy, and consent. There are also concerns around the impact of these models on the job market, as they can automate tasks previously performed by humans.
The text highlighted in italics above is generated by ChatGPTFootnote a prompted to review risks associated with the rapid development of TGMs in an academic style. This text reads fluently and naturally, adhering to the academic writing norms to a certain degree. It contains non-trivial ideas such as referencing potential impacts on the job market and creating fake personas. Overall, this demonstrates how generated text can be smoothly integrated into an academic article, compromising its authenticity.
The field of artificial text detection (ATD) (Jawahar, Abdul-Mageed, and Lakshmanan Reference Jawahar, Abdul-Mageed and Lakshmanan2020) aims to develop resources and computational methods to mitigate the risks of misusing TGMs. With advancements of TGMs, the problem has received special interest in the community since humans struggle to distinguish between natural and neural texts (Gehrmann, Strobelt, and Rush Reference Gehrmann, Strobelt and Rush2019; Ippolito et al. Reference Ippolito, Duckworth, Callison-Burch and Eck2020; Karpinska, Akoury, and Iyyer Reference Karpinska, Akoury and Iyyer2021; Uchendu et al. Reference Uchendu, Ma, Le, Zhang and Lee2021). Detection of artificial texts has been framed in multiple ways, featuring various task formulations and labelling schemes. The most standardised task is a binary classification problem with the goal of determining if the text is automatically generated or not (Adelani et al. Reference Adelani, Mai, Fang, Nguyen, Yamagishi and Echizen2020; Bahri et al. Reference Bahri, Tay, Zheng, Brunk, Metzler and Tomkins2021). Uchendu et al. (Reference Uchendu, Le, Shu and Lee2020) studied the neural authorship attribution task aimed to single out one TGM that generated the text. Dugan et al. (Reference Dugan, Ippolito, Kirubarajan, Shi and Callison-Burch2023) formulate the boundary-detection task: detect a change point in the text, where a natural text transitions into a neural one.
Although ATD is rapidly developing, there is still a need for creating resources for non-English languages that account for the diversity of the TGMs, natural language generation tasks, and text domains (Uchendu et al. Reference Uchendu, Mikhailov, Lee, Venkatraman, Shavrina and Artemova2022). In this paper, we introduce corpus of artificial texts (CoAT), a large-scale ATD corpus for Russian composed of human-written texts from publicly available resources and artificial texts produced by 13 TGMs, varying in the number of parameters, architecture choices, pre-training objectives, and downstream applications. Each TGM is fine-tuned for one or more of six natural language generation tasks, ranging from paraphrase generation to text summarisation. CoAT provides two task formulations and public leaderboards: (i) detection of artificial texts, i.e., classifying if a given text is machine-generated or human-written; (ii) authorship attribution, i.e., classifying the author of a given text among 14 candidates. The design of our corpus enables various experiment settings, ranging from analysing the dependence of the detector performance on the natural language generation task to the robustness of detectors towards unseen TGMs and text domains.
Contributions
Our main contributions are the following: (i) We create CoAT, a large-scale corpus for artificial text detection in Russian (Section 3). (ii) We present a linguistic analysis of the corpus, focusing on the distribution of stylometric features in human-written and machine-generated texts (Section 4). (iii) We provide a detailed analysis of human annotators, non-neural, and transformer-based artificial text detectors in five experiment settings (Section 5). (iv) We release CoAT,Footnote b source code,Footnote c and human evaluation project and provide two public leaderboards on the Kaggle platformFootnote d.Footnote e
2. Related work
2.1 Datasets and benchmarks
The community has put much effort into creating datasets and benchmarks for ATD tasks that cover various domains and architectures of TGMs. The design generally includes collecting human-written texts from publicly available sources and generating synthetic texts with a specific decoding strategy by (i) prompting a pretrained TGM without domain adaptation (e.g., with a news article title; Uchendu et al. Reference Uchendu, Le, Shu and Lee2020, Reference Uchendu, Ma, Le, Zhang and Lee2021), (ii) prompting a fine-tuned TGM (i.e., after continuing the pre-training on texts from the target domain; Gupta et al. Reference Gupta, Nguyen, Yamagishi and Echizen2020), and (iii) using a TGM trained or fine-tuned for a particular text generation task (e.g., MT; Aharoni, Koppel, and Goldberg Reference Aharoni, Koppel and Goldberg2014).
The GPT-2 output dataset is one of the first to address the detection of texts produced by modern large-scale TGMs (Radford et al. Reference Radford, Narasimhan, Salimans and Sutskever2018). The datasetFootnote f consists of natural texts from WebText (Reddit) and texts produced by multiple versions of the GPT-2 model fine-tuned on WebText. Munir et al. (Reference Munir, Batool, Shafiq, Srinivasan and Zaffar2021) and Diwan et al. (Reference Diwan, Chakraborty and Shafiq2021) extracted generated text from the subreddit r/SubSimulatorGPT2. Users of this subreddit are GPT-2-based TGMs fine-tuned on the posts and comments from a specific subreddit. The TweepFake dataset (Fagni et al. Reference Fagni, Falchi, Gambini, Martella and Tesconi2021) contains tweets posted by $40$ accounts, including statistical and neural TGMs and human users. Adelani et al. (Reference Adelani, Mai, Fang, Nguyen, Yamagishi and Echizen2020) propose a dataset for mitigating the generation of fake product reviews using out-of-the-box TGMs and TGMs continuously pretrained on the target domain. Kashnitsky et al. (Reference Kashnitsky, Herrmannova, de Waard, Tsatsaronis, Fennell and Labbe2022), Rodriguez et al. (Reference Rodriguez, Hay, Gros, Shamsi and Srinivasan2022), and Liyanage et al. (Reference Liyanage, Buscaldi and Nazarenko2022) design datasets to detect artificially generated academic content and explore the robustness of trainable detectors towards unseen research domains. The risk of spreading neural fake news and misinformation has facilitated the creation of ATD resources in the news domain, such as the GROVER dataset (Zellers et al. Reference Zellers, Holtzman, Rashkin, Bisk, Farhadi, Roesner and Choi2019), the NeuralNews dataset (Tan, Plummer, and Saenko Reference Tan, Plummer and Saenko2020), and the “All the News” dataset (Gupta et al. Reference Gupta, Nguyen, Yamagishi and Echizen2020).
A few recent works explore multi-domain ATD. Bakhtin et al. (Reference Bakhtin, Gross, Ott, Deng, Ranzato and Szlam2019) collect and generate texts from news, multi-genre fictional books, and Wikipedia. Stiff and Johansson (Reference Stiff and Johansson2022) propose a dataset comprising texts from news articles, product reviews, forum posts, and tweets.
Table 1 provides an overview of existing works that have explored artificial text detection for languages other than English. Independent efforts have been made to develop datasets for Bulgarian (Temnikova et al. Reference Temnikova, Marinova, Gargova, Margova and Koychev2023), Chinese (Chen et al. Reference Chen, Jin, Jing and Xie2022), and Spanish (Sarvazyan et al. Reference Sarvazyan, González, Franco-Salvador, Rangel, Chulvi and Rosso2023). Wang et al. (Reference Wang, Mansurov, Ivanov, Su, Shelmanov, Tsvigun, Afzal, Mahmoud, Puccetti and Arnold2024a, Reference Wang, Mansurov, Ivanov, Su, Shelmanov, Tsvigun, Whitehouse, Mohammed Afzal, Mahmoud, Sasaki, Arnold, Aji, Habash, Gurevych and Nakov2024c) collect M4, a large-scale multilingual dataset, to test the generalisation abilities of artificial text detectors. Our work differs from related studies in that we (i) create one of the first multi-domain and multi-generator large-scale corpora of artificial texts for Russian, covering standard downstream natural language generation tasks and TGMs; (ii) present a follow-up work on a shared task for artificial text detection in Russian (Shamardina et al. Reference Shamardina, Mikhailov, Chernianskii, Fenogenova, Saidov, Valeeva, Shavrina, Smurov, Tutubalina and Artemova2022), increasing the corpus size and extending the experimental setup to analyse the linguistic properties of human-written and machine-generated texts and explore the robustness of various detectors towards the text domain and TGM size. The earlier CoAT version, namely, the RuATD subcorpus, has been included in M4 and used at SemEval-2024 Task 8 on multi-domain, multimodel, and multilingual ATD (Wang et al. Reference Wang, Mansurov, Ivanov, su, Shelmanov, Tsvigun, Mohammed Afzal, Mahmoud, Puccetti, Arnold, Whitehouse, Aji, Habash, Gurevych and Nakov2024b, Reference Wang, Mansurov, Ivanov, Su, Shelmanov, Tsvigun, Whitehouse, Mohammed Afzal, Mahmoud, Sasaki, Arnold, Aji, Habash, Gurevych and Nakov2024c).
2.2 Artificial text detectors
Feature-based detectors
Classical machine learning methods are widely employed for detecting generated texts. Linear classifiers over TF-IDF character, sub-word, and word N-grams can serve as lightweight and strong baseline detectors. (Manjavacas et al. Reference Manjavacas, De Gussem, Daelemans and Kestemont2017; Solaiman et al. Reference Solaiman, Brundage, Clark, Askell, Herbert-Voss, Wu, Radford, Krueger, Kim, Kreps, McCain, Newhouse, Blazakis, McGuffie and Wang2019; Ippolito et al. Reference Ippolito, Duckworth, Callison-Burch and Eck2020).Badaskar et al. (Reference Badaskar, Agarwal and Arora2008) train detectors over morphological, syntactic, and discourse features, such as POS tags, syntax-based LM’s grammaticality scores, and coherence. Another group of features is based on stylometry, a branch of computational linguistics that relies on statistical methods for authorship attribution and analysis of literary style (Holmes, Reference Holmes1994, Abbasi and Chen, Reference Abbasi and Chen2008; Abbasi and Chen, Reference Abbasi and Chen2008). Stylometric features are used to train detectors and characterise properties of machine-generated and human-written texts (Uchendu et al. Reference Uchendu, Le, Shu and Lee2020, Reference Uchendu, Ma, Le, Zhang and Lee2021). Specific types of stylometric features can capture issues related to TGMs (Fröhling & Zubiaga, Reference Fröhling and Zubiaga2021): (i) lack of syntactic and lexical diversity (POS tags, named entities, and coreference chains), (ii) repetitiveness (N-gram overlap of words and POS tags and counters of stopwords and unique words), (iii) lack of coherence (the appearance and grammatical roles of named entities), and (iv) lack of purpose (lexical features that represent spatial properties, sentiment, opinion, and logic).
The feature-based detectors are interpretable, cost-effective, and helpful when the dataset size is small (Uchendu, Le and Lee Reference Uchendu, Le and Lee2023). Stylometric detectors show usefulness in recognising texts generated with certain decoding strategies (Fröhling & Zubiaga, Reference Fröhling and Zubiaga2021) but are significantly inferior in performance compared to Transformer-based detectors (Schuster et al. Reference Schuster, Schuster, Shah and Barzilay2020; Diwan et al. Reference Diwan, Chakraborty and Shafiq2021; Jones Nurse and Li Reference Jones, Nurse and Li2022).
Transformer-based detectors
The current state-of-the-art approach is fine-tuning a pretrained Transformer LM for the ATD classification task. Zellers et al. (Reference Zellers, Holtzman, Rashkin, Bisk, Farhadi, Roesner and Choi2019) train a linear layer over the hidden representations from the GROVER and GPT-2 TGMs. RoBERTa (Liu et al. Reference Liu, Ott, Goyal, Du, Joshi, Chen, Levy, Lewis, Zettlemoyer and Stoyanov2019) has demonstrated an outstanding performance with respect to many TGMs’ configurations and domains (Adelani et al. Reference Adelani, Mai, Fang, Nguyen, Yamagishi and Echizen2020; Fagni et al. Reference Fagni, Falchi, Gambini, Martella and Tesconi2021). The TGM-based detectors identify texts generated by the previous TGMs’ versions better than by the more recent ones, which confirms that newer TGMs generate more human-like texts. Kushnareva et al. (Reference Kushnareva, Cherniavskii, Mikhailov, Artemova, Barannikov, Bernstein, Piontkovskaya, Piontkovski and Burnaev2021) introduce the topological data analysis (TDA) method for ATD, which combines properties of the feature-based and Transformer-based detectors. The TDA features are extracted from the BERT attention matrices that are represented as weighted attention graphs and used to train a linear detector. The features include standard graph properties, descriptive characteristics of barcodes, and distances to attention patterns (Clark et al. Reference Clark, Khandelwal, Levy and Manning2019). The TDA-based detectors outperform TF-IDF-based and neural detectors and show better robustness toward the size of unseen GPT-style TGMs. The Transformer-based detectors are highly effective in ATD tasks and generalisable to out-of-domain sets but computationally expensive.
Zero-shot detectors
An alternative set of methods involves using probability-based measures along with predetermined thresholds (Solaiman et al. Reference Solaiman, Brundage, Clark, Askell, Herbert-Voss, Wu, Radford, Krueger, Kim, Kreps, McCain, Newhouse, Blazakis, McGuffie and Wang2019; Mitchell et al. Reference Mitchell, Lee, Khazatsky, Manning and Finn2023). These methods enable the human-in-the-loop approach, in which a user can recognise whether a text is machine-generated with the assistance of pretrained LMs. The GLTR tool (Gehrmann et al. Reference Gehrmann, Strobelt and Rush2019) facilitates interaction between humans and models by presenting statistical characteristics of a text inferred by the model, thereby enhancing the ability of humans to identify artificial texts. MAUVE (Pillutla et al. Reference Pillutla, Swayamdipta, Zellers, Thickstun, Welleck, Choi and Harchaoui2021) is a statistical detector that determines the difference in distribution between human and neural texts by utilising the KL-divergence. When MAUVE highlights differences between human and neural texts, human identification of texts generated by GROVER and GPT-2 strongly improves. Dugan et al. (Reference Dugan, Ippolito, Kirubarajan and Callison-Burch2020) propose RoFT (Real or Fake Text), a tool that assists in distinguishing between human-written and machine-generated texts. Their work highlights that TGMs have the ability to deceive humans with just a few sentences. Gallé et al. (Reference Gallé, Rozen, Kruszewski and Elsahar2021) explore unsupervised ATD, employing repeated higher-order N-grams. Their findings indicate that certain well-formed phrases occur more frequently in machine-generated texts compared to those written by humans. Although zero-shot detectors tend to underperform compared to the other established methods, they are beneficial in aiding humans to detect machine-generated texts.
3. Design
CoAT is composed of 246k human-written and artificial texts. The corpus creation methodology includes three main stages: (i) collecting human-written texts, (ii) artificial text generation, and (iii) post-processing and filtering.
3.1 Human-written text collection
We collect human-written data from six domains that cover normative Russian, general domain texts, social media posts, texts from various historical periods, bureaucratic texts with complex discourse structure and embedded named entities, and other domains specified in the task-specific datasets, such as subtitles and web-texts. It is important to note that, in addition to linguistic and stylometric features, texts vary in length (e.g., sentence-level vs. document-level) and peculiarities inherent to the downstream tasks described in more detail below in Section3.2.
Russian National Corpus. We use the diachronic sub-corpora of the Russian National Corpus (RNC),Footnote g which covers three historical periods and represents the modern Russian language (“pre-Soviet,” “Soviet,” and “post-Soviet”).
Social media. We collect posts published between 2010 and 2020 using the X (Twitter) API by querying generic, frequently used hashtags. These hashtags are not tied to specific domains or topics and include terms such as days of the week, months, seasons, holidays, or the names of the most popular cities in Russia. These texts are generally short, written in an informal style, and may include emojis and obscene lexis. We anonymise the texts by excluding the user handles and IDs.
Wikipedia. We use paragraphs from the top 100 most viewed Russian Wikipedia pages spanning the period of 2016–2021 according to the PageViewsFootnote h statistics.
News articles. The news segment covers different news sources in the Taiga corpus (Shavrina and Shapovalova Reference Shavrina and Shapovalova2017) and the corus library.Footnote i Since these corpora are publicly available, we additionally parse more recent news articles from various Russian news websites published at the end of 2021 to prevent potential data leakage so that the test set, at the moment of creating the corpus, does not contain samples that could not be retrieved from the news websites.
Digitalised personal diaries. We use texts from the Prozhito corpus (Melnichenko and Tyshkevich Reference Melnichenko and Tyshkevich2017), which includes digitilised personal diaries written during the 20th century.
Strategic documents. Here, we use strategic documents from the Ministry of Economic Development of the Russian Federation (Ivanin et al. Reference Ivanin, Artemova, Batura, Ivanov, Sarkisyan, Tutubalina and Smurov2020). The documents are written in a bureaucratic style, rich in embedded entities, and have complex syntactic and discourse structure.
Task-specific datasets. We collect gold standard references from the Wikimatrix (Schwenk et al. Reference Schwenk, Chaudhary, Sun, Gong and Guzmán2021) and Tatoeba (Tiedemann Reference Tiedemann2012) machine translation datasets since they are generally written and/or validated by human annotators (Artetxe and Schwenk Reference Artetxe and Schwenk2019; Scialom et al. Reference Scialom, Dray, Lamprier, Piwowarski and Staiano2020; Hasan et al. Reference Hasan, Bhattacharjee, Islam, Mubasshir, Li, Kang, Rahman and Shahriyar2021). To ensure that low-quality instances are not included in CoAT, we filter these datasets based on the sentence length and remove duplicates before translating them into Russian.
3.2 Artificial text generation
We use human-written texts as the input to 13 TGMs varying in their number of parameters, architecture choices, and pre-training objectives. Each model is fine-tuned for one or more of the following natural language generation tasks: machine translation, paraphrase generation, text simplification, and text summarisation. In addition, we consider back-translation and zero-shot generation approaches.
Machine translation & back-translation. We use three machine translation models via the EasyNMT framework:Footnote j OPUS-machine translation (Tiedemann and Thottingal Reference Tiedemann and Thottingal2020), M-BART50 (Tang et al. Reference Tang, Tran, Li, Chen, Goyal, Chaudhary, Gu and Fan2020), and M2M-100 (Fan et al. Reference Fan, Bhosale, Schwenk, Ma, El-Kishky, Goyal, Baines, Celebi, Wenzek and Chaudhary2021). We select these models for their near state-of-the-art performance in English-Russian translation, the ease of use of the EasyNMT framework, and the diversity they offer in machine translation approaches OPUS-MT for one-to-one, M-BART50 and M2M-100 for many-to-many translations, with M-BART50 featuring a pretrained backbone. We use subsets of the Tatoeba (Artetxe and Schwenk Reference Artetxe and Schwenk2019) and WikiMatrix (Schwenk et al. Reference Schwenk, Chaudhary, Sun, Gong and Guzmán2021) datasets to obtain translations among three language pairs: English-Russian, French-Russian, and Spanish-Russian. In the back-translation setting, the input sentence is translated into one of the target languages, and then back into Russian.
Paraphrase generation. Paraphrases are generated with models available under the russian-paraphrasers library (Fenogenova Reference Fenogenova2021a): ruGPT2-Large,Footnote k ruGPT3-Large (Zmitrovich et al. Reference Zmitrovich, Abramov, Kalmykov, Kadulin, Tikhonova, Taktasheva, Astafurov, Baushenko, Snegirev, Shavrina, Markov, Mikhailov and Fenogenova2024), ruT5-Base-Multitask,Footnote l and mT5 (Xue et al. Reference Xue, Constant, Roberts, Kale, Al-Rfou, Siddhant, Barua and Raffel2021) of Small and Large versions.
Text simplification. We fine-tune ruGPT3-Small (Zmitrovich et al. Reference Zmitrovich, Abramov, Kalmykov, Kadulin, Tikhonova, Taktasheva, Astafurov, Baushenko, Snegirev, Shavrina, Markov, Mikhailov and Fenogenova2024), ruGPT3-Medium (Zmitrovich et al. Reference Zmitrovich, Abramov, Kalmykov, Kadulin, Tikhonova, Taktasheva, Astafurov, Baushenko, Snegirev, Shavrina, Markov, Mikhailov and Fenogenova2024), ruGPT3-Large, mT5-Large, and ruT5-Large (Zmitrovich et al. Reference Zmitrovich, Abramov, Kalmykov, Kadulin, Tikhonova, Taktasheva, Astafurov, Baushenko, Snegirev, Shavrina, Markov, Mikhailov and Fenogenova2024) for text simplification on a filtered version of the RuSimpleSentEval-2022 dataset (Sakhovskiy et al. Reference Sakhovskiy, Izhevskaya, Pestova, Tutubalina, Malykh, Smurov and Artemova2021; Fenogenova Reference Fenogenova2021b). Fine-tuning of each model is run for $4$ epochs with a batch size of $4$ , learning rate of $10^{-5}$ , and weight decay of $10^{-2}$ .
Text summarisation. We use two abstractive summarisation models fine-tuned on the Gazeta dataset (Gusev Reference Gusev2020): ruT5-baseFootnote m and M-BART.Footnote n
Open-ended generation. We generate texts in a zero-shot manner by prompting the model with a beginning of the text and specifying the maximum number of 500 tokens in the model output. The models include ruGPT3-Small, ruGPT3-Medium, ruGPT3-Large.
3.3 Post-processing and filtering
Each generated text undergoes a post-processing and filtering procedure based on a combination of language processing tools and heuristics. First, we discard text duplicates, copied inputs, and empty outputs, and remove special tokens (e.g., <s>, </s>, <pad>, etc.). Next, we discard texts containing obscene lexis according to the corpus of Russian obscene words.Footnote o We keep translations classified as Russian by the language detection modelFootnote p with a confidence of more than $90$ %. Finally, we empirically define text length intervals for each natural language generation task based on a manual analysis of length distributions in razdel Footnote q this library tokens. The texts are filtered by the following token ranges: 5-to-25 (machine translation, back-translation, paraphrase generation), 10-to-30 (text simplification), 15-to-60 (text summarisation), and 85-to-400 (open-ended generation). Thus, we remove the possibility of using length as a feature to distinguish texts and reduce the size of the corpus by approximately 30%.
4. Corpus analysis
4.1 General statistics
Number of texts
Tables 2 and 3 summarise the distribution of texts in CoAT by natural language task, model, and domain. The number of human-written and machine-generated texts is balanced within each natural language task and domain. Texts from the Russian National Corpus are the most common in CoAT (20%). News articles make up a percentage of 19.6% of the total number of texts, followed by strategic documents (18.5%) and texts from Wikipedia (17.5%). Digitalised personal diaries comprise 10%, while texts from social media and machine translation datasets account for 7% and 6%, respectively.
Length and frequency
Table 4 presents a statistical analysis of CoAT based on the frequency and lexical richness metrics. We compute the token frequency in each text as the number of frequently used tokens (i.e., the number of instances per million in the Russian National Corpus is higher than one) divided by the total number of tokens in the text. The average text length in razdel tokens is similar for the sentence-level natural language generation tasks (back-translation, machine translation, and paraphrase generation), while texts produced by text simplification and summarisation models are of $20.59$ and $31.45$ tokens on average, respectively. The overall average text length is $49.64$ . The distribution of high-frequency tokens in human-written and machine-generated texts is similar within each generation task, comprising of 88% high-frequency tokens on average.
Lexical richness
We evaluate the lexical richness of the CoAT texts using three measures from the lexicalrichness Footnote r library: word count, terms count, and corrected type-token ratio (CTTR). The corrected type-token ratio is calculated as $t/\sqrt{2 * w}$ , where $w$ is the total number of all words, including functional, and $t$ is the number of unique terms in the vocabulary. We observe that the ratio of the measures between the natural and artificial texts depends on the task. In contrast to other natural language generation tasks, neural texts generated using the open-ended generation approach are generally longer than the human-written ones, and receive higher richness metrics. The reason is that the models try to generate texts up to the maximum number of 500 tokens, and may produce non-existent words, degenerated textual segments, or rare words.
4.2 Linguistic analysis
Stylometric features
Stylometry helps characterise the author’s style based on various linguistic features, which are commonly applied in the ATD and human and neural AA tasks (He et al. Reference He and Rasheed2004; Lagutina et al. Reference Lagutina, Lagutina, Boychuk, Vorontsova, Shliakhtina, Belyaeva, Paramonov and Demidov2019; Uchendu et al. Reference Uchendu, Cao, Wang, Luo and Lee2019, Reference Uchendu, Le, Shu and Lee2020). This section aims to analyse the stylometric properties of human-written and machine-generated texts.
Following the motivation by Fröhling and Zubiaga (Reference Fröhling and Zubiaga2021), we manually inspect a subset of the artificial texts to define stylometric features that potentially capture text generation errors (see Table 5). The features can be informally categorised as follows:
(1) Surface and frequency features: (i) the text’s length in characters (Length), (ii) the fraction of punctuation marks in the text (PUNCT), (iii) the fraction of Cyrillic symbols in the text (Cyrillic), and (iv) IPM.
(2) Count-based morphological features: the fraction of (i) prepositions (PREP) and (ii) conjunctions in the text. The features are computed with pymorphy2 (Korobov Reference Korobov2015), a morphological analyser for Russian and Ukrainian.
(3) Readability measures: (i) Flesch-Kincaid Grade (FKG), (ii) Flesch Reading-Ease (FRE), and (iii) LIX adapted to the Russian language. We use the ruTS library (Shkarin Reference Shkarin2023) to compute the features.
Statistical analysis
Table 6 presents the summary of the stylometric feature values for the human-written and machine-generated texts and each text generator independently.Footnote s The Length is attributed to the natural language generation task; texts produced by MT and paraphrase generation models will be naturally shorter. In contrast, ruGPT3-Small, ruGPT3-Medium, and ruGPT3-Large generate longer texts, which is the case for the zero-shot generation setup. At the same time, Cyrillic varies between the models and can be specific to text translations, decoding confusions, copying parts of the input in another language, and hallucinations. We observe that the percentage of high-frequent words (IPM) in the texts is similar among the generators. The morphological and readability features differ on average between the TGMs and humans and between the TGMs. The results are supported by the Mann-Whitney U test used to compare the differences between the distributions of the stylometric features in human-written and machine-generated texts. There is a significant difference between the mean values for all features except for IPM and PREP, with $\alpha =0.05$ .
Analysis of the feature space
We use the principal component analysis (PCA; Pearson Reference Pearson1901) on the stylometric features computed on the weighted corpus subset. Figure 1 illustrates the 2-dimensional distribution of the features by generative model. A large overlapped portion among the texts indicates that the stylometric features may not be useful in solving the ATD tasks.
Discussion
The linguistic analysis indicates that distributions of most of the considered stylometric features underlying the human-written and machine-generated texts are not the same (see Figure 2). However, a substantial portion of the texts overlaps, meaning that the properties of the texts are similar among the generators. We conclude, that due to this overlap, it might be challenging to distinguish between the human-written and machine-generated texts and identify the author of a given text by utilising only stylometric features.
5. Experiments
5.1 Method
Non-neural detectors
We use two non-neural text detectors via the scikit-learn library (Pedregosa et al. Reference Pedregosa, Varoquaux, Gramfort, Michel, Thirion, Grisel, Blondel, Prettenhofer, Weiss and Dubourg2011): a data-agnostic classifier, referred to as “Majority,” that predicts the most frequent class label in the training data, and a logistic regression classifier trained on TF-IDF features computed on word N-grams with the N-gram range $\in [1; 3]$ , denoted as “Linear”.
Transformer-based detectors
We experiment with four monolingual and cross-lingual Transformer LMs: ruBERT-base (Zmitrovich et al. Reference Zmitrovich, Abramov, Kalmykov, Kadulin, Tikhonova, Taktasheva, Astafurov, Baushenko, Snegirev, Shavrina, Markov, Mikhailov and Fenogenova2024; 178M), ruRoBERTa-large (Zmitrovich et al. Reference Zmitrovich, Abramov, Kalmykov, Kadulin, Tikhonova, Taktasheva, Astafurov, Baushenko, Snegirev, Shavrina, Markov, Mikhailov and Fenogenova2024; 355M), XLM-R-base (Conneau et al. Reference Conneau, Khandelwal, Goyal, Chaudhary, Wenzek, Guzmán, Grave, Ott, Zettlemoyer and Stoyanov2020; 278M), and RemBERT (Chung et al. Reference Chung, Fevry, Tsai, Johnson and Ruder2020; 575M).
Performance metrics
The accuracy score is used as the target metric in the binary classification problems (Sections 5.2; 5.3; and 5.5), and macro-averaged $F_1$ is used as the target metric in the multi-class classification problem (Section 5.4). The results are averaged over three restarts with different random seeds.
Training details
We maximise the validation set performance by running a grid search over a set of hyperparameters. We tune the $L_2$ regularisation coefficient $C \in \{0.01, 0.1, 1.0\}$ to train the logistic regression model. We use the Transformer LMs’ weights and codebase for fine-tuning and evaluation from the HuggingFace Transformers library (Wolf et al. Reference Wolf, Debut, Sanh, Chaumond, Delangue, Moi, Cistac, Rault, Louf, Funtowicz, Davison, Shleifer, von Platen, Ma, Jernite, Plu, Xu, Le Scao, Gugger, Drame, Lhoest and Rush2020). The detectors are fine-tuned for $5$ epochs over the learning rate $10^{-5}$ , with a fixed weight decay of $10^{-4}$ and batch sizes of $32$ for RemBERT and $64$ for the other LMs.
Splits
CoAT is split into train, validation, and private test sets in the 70/10/20 ratio in a stratified fashion (172k/24k/49k examples). We use the sets to create train, validation, and private test subsets for each experiment described below: (i) artificial text detection (Section 5.2), (ii) artificial text detection by natural language generation task (Section 5.3, (iii) authorship attribution (Section 5.4), (iv) robustness towards unseen TGM (Section 5.5.1), and (v) robustness towards unseen text domain (Section 5.5.2).
In Sections 5.2, 5.3, and 5.5, we train or fine-tune binary classifiers to distinguish between text written by a human and text generated with a TGM. In Section 5.4, the machine-generated label in the artificial text detection task is broken into 13 model names in the authorship attribution task. In each experiment configuration, we use the corresponding subsets of the CoAT train, validation, and private test set splits, which are balanced by the number of examples per target class, natural language generation task, TGM, and text domain.
5.2 Artificial text detection
Task formulation
This experiment aims to evaluate the performance of detectors in determining if a given text is automatically generated or written by a human. This task is framed as a binary classification problem with two labels: H (human-written) and M (machine-generated).
Splits
CoAT is split into train, validation, and private test sets in the 70/10/20 ratio in a stratified fashion (172k/24k/49k examples).
Human baseline
We conduct a human evaluation on the artificial text detection task using a stratified subset of 5k samples from the test set. The evaluation is run via Toloka (Pavlichenko, Stelmakh, and Ustalov Reference Pavlichenko, Stelmakh and Ustalov2021), a crowd-sourcing platform for data labelling.Footnote t The annotation setup follows the conventional crowd-sourcing guidelines for the ATD task and accounts for methodological limitations discussed in Ippolito et al. (Reference Ippolito, Duckworth, Callison-Burch and Eck2020); Clark et al. (Reference Clark, August, Serrano, Haduong, Gururangan and Smith2021). We provide a full annotation instruction in Appendix A and an example of the Toloka web interface in Figure 3.
The human evaluation project consists of an unpaid training stage and the main annotation stage with honeypot task for tracking annotators’ performance. The honeypot tasks are manually picked quality verification examples from the CoAT training set, which are mixed in with the main unlabelled examples. We compute the annotation performance by comparing the annotators’ votes with the honeypot examples’ gold labels. Before starting, the annotator receives a detailed instruction describing the task and showing annotation examples. The instruction is available anytime during the training and main annotation stages. Access to the project is granted to annotators ranked top- $70$ % according to the Toloka rating system. Each annotator must finish the training stage by answering at least 27 out of 32 examples correctly. Each of the trained annotators gets paid. The pay rate is on average $2.55/h, which is twice the amount of the hourly minimum wage in Russia. We aggregate the majority vote labels via the dynamic overlap from three to five trained annotators after (i) discarding votes from annotators whose annotation performance on the honeypot tasks is less than 50% and (ii) filtering out submissions with less than 15 s of response time per five texts.
Results
Table 7 presents the results of this experiment. Transformer-based detectors outperform non-neural detectors by a wide margin with an average accuracy of 0.864. The highest accuracy scores were achieved using ruRoBERTa, followed by RemBERT, ruBERT, and XLM-RoBERTa. Notably, using monolingual ruRoBERTa and ruBERT led to a significant improvement in accuracy scores, despite having fewer parameters than XLM-RoBERTa and RemBERT. At the same time, human annotators significantly underperform the detectors. The human evaluation yields an overall accuracy of 0.66, which is 0.07 points lower than that of non-neural detectors and 0.2 points lower than that of neural detectors.
Effect of length
We divide the test set into five groups of equal size based on the text length to examine the effect of text length on the performance. As depicted in Figure 4, the F1-macro-scores of learnable detectors improved monotonically as the text length increases. For the shortest texts, non-neural and neural detectors achieved F1-macro-scores of 0.6 and 0.75, respectively. The scores increase significantly, resulting in near-perfect predictions of 0.95 to 0.99 for the longest texts. The differences between the four Transformer-based detectors are most prominent in shorter texts but became less significant as the text length increases.
Humans exhibit comparable or better performance than non-neural detectors when it comes to texts that are up to 31 words. However, humans seem to face difficulty in classifying texts that fall within the range of 32–92 words, while performing best in identifying the longest texts. A possible explanation for this consistent performance of humans within the 0.6 to 0.75 range of the macro-F1 could be that humans tend to rely more on surface-level features, which may have a similar distribution across all text length groups.
Discussion
The results demonstrate that state-of-the-art Transformer-based models can be relatively successful in distinguishing human-written texts from machine-generated ones for the Russian language. However, one can quickly notice a rather stark contrast between the best scores obtained on the CoAT test set in binary setup (0.86 accuracy) and scores obtained for a similar setup in English (0.970 accuracy; see Uchendu et al. (Reference Uchendu, Le, Shu and Lee2020) for reference). We attribute this disparity not to the language discrepancy but primarily to the nature of the texts. In the English setup, the average text length was 432 words, which is nearly nine times longer than the CoAT’s 50-word average. Our findings align with the works by Munir et al. (Reference Munir, Batool, Shafiq, Srinivasan and Zaffar2021); Stiff and Johansson (Reference Stiff and Johansson2022), which report that shorter texts are the most challenging for both human annotators and computational methods.
The low results in human evaluation are consistent with recent works by Karpinska et al. (Reference Karpinska, Akoury and Iyyer2021); Uchendu et al. (Reference Uchendu, Ma, Le, Zhang and Lee2021), which underpin the difficulty of the generated text detection task for crowd-sourcing annotators. These works advise hiring experts trained to evaluate written texts or conduct multiple crowd-sourcing evaluation setups with extensive training stages.
5.3 Artificial text detection by task
Task formulation
In this experiment, we create six datasets with respect to each natural language generation task. The detectors are independently trained and evaluated on the human-written and machine-generated texts from the same task. This setup allows for estimating the complexity of the tasks. The lower the performance scores, the more natural the TGMs in a particular task.
Splits
CoAT is split into train, validation, and private test sets in the 70/10/20 ratio in a stratified fashion: 9.8k/1.4k/2.7k (machine translation), 12.4k/1.7k/3.4k (back-translation), 31.8k/4.4k/8.9k (paraphrase generation), 26.6k/3.7k/7.3k (text summarisation), 28k/3.8k/7.9k (text simplification), and 63k/9.2k/17.9k (open-ended generation).
Results
The results of our experiment are presented in Table 8. Averaged macro- $F_1$ scores are presented in Table 9. We observe that the detectors’ performance rankings are consistent with their performance in the artificial text detection setup (Section 5.2). ruRoBERTa consistently outperforms the other detectors with an average macro- $F_1$ of 0.814, while other models show moderate performance. Analysing the results by a natural language generation task, we find that machine-translated texts are more difficult to detect, where the best detector performs only slightly better than random prediction, with an accuracy of 0.621 and macro- $F_1$ of 0.612. Tasks such as Back-translation, Paraphrase Generation, and Text Simplification are of intermediate difficulty, resulting in an accuracy of 0.768 and macro- $F_1$ of 0.765 or above. Finally, we find that Text Summarisation and Open-ended Generation are much easier to detect. In fact, the detectors made near-perfect predictions in Open-ended Generation, with accuracies and macro- $F_1$ scores reaching up to 0.99.
Discussion
We suggest that the diversity of TGMs’ output and the degree of control over the outputs in the task affect the task difficulty. The TGM outputs of Machine Translation and Paraphrase Generation are semantically constrained by the inputs, thus it is less likely for these TGM to produce in-plausible or repetitive outputs. Open-ended Generation models suffer from hallucination and repetitions, which at scale can be learnt by detectors. At the same time.
5.4 Authorship attribution
Task formulation
The author attribution task aims at determining the author of a given text. The task is framed as a multi-class classification problem with 14 target classes: a human author and 13 TGMs. In this experiment, we use the same dataset as in Section 5.2, but this time instead of binary labels we use the source TGM’s labels as prediction targets. In this setup, the dataset is imbalanced: 50% of samples are human-written and the other 50% of samples contain outputs of 13 TGMs. In this case, we rely on macro- $F_1$ as the main performance metric.
Splits
CoAT is split into train, validation, and private test sets in the 70/10/20 ratio in a stratified fashion (172k/24k/49k examples).
Results
Our findings, shown in Table 10, demonstrate that RuRoBERTa significantly outperforms other models with a macro- $F_1$ score of 0.521. RemBERT and ruBERT performed similarly with macro- $F_1$ scores of 0.496 and 0.476, while XLM-RoBERTa achieved a lower macro- $F_1$ score of 0.451. These findings align with the results of artificial text detection (Section 5.2) and artificial text detection by task (Section 5.3). In terms of TGMs, the neural detectors achieved the highest average performance of 0.75 $F_1$ in detecting ruT5-Base, while the lowest average performance of 0.112 $F_1$ was observed in detecting ruT5-Large. However, the TGM size did not significantly affect the performance of neural detectors in the ruGPT3 family, as the difference in $F_1$ between detecting ruGPT3-Small and ruGPT3-Large was only 0.01. In summary, neural detectors exhibit significantly higher accuracy in identifying human-written texts (with an average $F_1$ score of 0.866) compared to determining the source TGM.
Discussion
Authorship attribution is important when legal requirements demand revealing text generation models for transparency, claiming intellectual property, or replicating text generation. The task of authorship attribution is more challenging than the binary artificial text detection task, as observed in recent works (Uchendu et al. Reference Uchendu, Le, Shu and Lee2020, Reference Uchendu, Ma, Le, Zhang and Lee2021). The low-performance scores indicate that it is challenging to differentiate between the TGMs, as they may lack distinct features that set them apart from each other, which might align with the results of the corpus linguistic analysis in Section 4.
5.5 Robustness
5.5.1 Size of unseen GPT-based text generaton models
Task formulation
This experiment setting tests the detectors’ robustness towards the size of unseen GPT-based TGMs. The detectors are trained or fine-tuned on the human-written texts and texts generated by one of the GPT-based models (ruGPT3-Small, ruGPT3-Medium, and ruGPT3-Large) as described in Section 5.1 and further inferred to detect texts from the out-of-domain GPT-based TGMs, i.e., those that are not present the train set. Consider an example for the ruGPT3-Small model, where we train the detectors on a mixture of human-written texts and texts generated by only ruGPT3-Small, and evaluate the detectors on three mixtures of human-written texts and texts generated by ruGPT3-Small, ruGPT3-Medium, and ruGPT3-Large.
Splits
CoAT is split into train, validation, and private test sets in the 70/10/20 ratio in a stratified fashion: 26.2k/3.7k/7.7k (ruGPT3-small), 26k/3.7k/7.6k (ruGPT3-medium), and 28.4k/4k/8.2k (ruGPT3-large).
Results
The results are presented in Figure 5. We observe that larger models’ outputs are more challenging to identify as all figures show an evident drop in performance. However, training on larger models’ outputs improves the detectors’ robustness, as lines in the left-most plot become more flattened. Analysing the results by the detector, we find that all neural detectors achieve a strong performance of more than 90% accuracy. ruRoBERTa performs on par with RemBERT while having fewer parameters, and the linear detector receives moderate performance falling behind the Transformers by a large margin.
Discussion
Overall, the results align with the works of Solaiman et al. (Reference Solaiman, Brundage, Clark, Askell, Herbert-Voss, Wu, Radford, Krueger, Kim, Kreps, McCain, Newhouse, Blazakis, McGuffie and Wang2019) and Kushnareva et al. (Reference Kushnareva, Cherniavskii, Mikhailov, Artemova, Barannikov, Bernstein, Piontkovskaya, Piontkovski and Burnaev2021). The ATD task can become more challenging with scaled TGMs since the TGMs’ size directly determines the detector performance. However, the detectors receive the most optimal performance concerning different TGMs’ sizes when trained on outputs from the largest ones.
5.5.2 Unseen text domain
Task formulation
Here, we analyse whether the detectors are robust in classifying TGMs’ outputs from unseen text domains. Similar to Section 5.5.1, we train or fine-tune the detectors on the human-written and machine-generated texts from one domain and evaluate them on the texts from the other domains. Consider an example for the news domain, where we train the detectors on a mixture of human-written and machine-generated texts from only the news domain, and evaluate the detectors on six mixtures of human-written and machine-generated texts from the domains of RNC, social media, Wikipedia, digitalised diaries (Prozhito), strategic documents (Minek), and news.
Splits
CoAT is split into train, validation, and private test sets in the 70/10/20 ratio in a stratified fashion: 30.8k/4.4k/10.2k (strategic documents), 34.7k/4.9k/8.5k (news articles), 17.2k/2.4k/4.9k (digitalised diaries), 36.3k/5.2k/8.5k (Russian National Corpus), 12.5k/1.7k/2.9k (social media), and 28.8k/4k/9.9k (Wikipedia).
Results
Figure 6 presents the results of this experiment. Overall, the detectors demonstrate similar transferred performance across all domains. The in-domain performance is up to 90% regardless of the detectors’ number of weight parameters and architecture. The detection is more reliable when training on texts from News and Wikipedia as indicated by less spiky patterns in the corresponding figures showing results for the models trained on these subsets. However, training on texts from Minek, Prozhito, and Social Media may result in near-random performance on the out-of-domain test sets, as corresponding figures show a single outstanding peak. We also observe that ruRoBERTa receives higher transferred accuracy than the other detectors, as seen from the purple dash-dotted line being on top of other lines in almost all patterns.
Discussion
While the TGMs are rapidly proliferating in different areas of life, the related ATD research primarily focuses on one particular domain (see Section 2.1). The single-domain evaluation limits the analysis of the detectors’ limitations. Several works report that existing detectors exhibit inconsistent multi-domain performance (Bakhtin et al. Reference Bakhtin, Gross, Ott, Deng, Ranzato and Szlam2019; Kushnareva et al. Reference Kushnareva, Cherniavskii, Mikhailov, Artemova, Barannikov, Bernstein, Piontkovskaya, Piontkovski and Burnaev2021). To our knowledge, our work is among the first to analyse the detectors’ cross-domain generalisation. We empirically show that the detectors fail to transfer when trained on specific domains, such as strategic financial documents and social media posts.
6. Conclusion and future work
This work proposes CoAT, a large-scale corpus for neural text detection in Russian. CoAT comprises more than 246k human-written and machine-generated texts, covering various generative language models, natural language generation tasks, and text domains. Our corpus promotes the development of multi-domain artificial text detectors to warn humans about potentially generated content on news and social media platforms, such as fake news, generated product reviews, and propaganda spread with bots. We present a linguistic analysis of our corpus and extensively evaluate the feature-based and Transformer-based artificial text detectors. The key empirical results indicate that humans struggle to detect the generated text. At the same time, the detectors fail to transfer when trained on outputs from smaller TGMs and specific text domains.
In this paper, we explore multiple experimental setups in which we find the following:
(i) fine-tuning state-of-the-art Transformer-based models to determine whether the text was written by a human or generated by a machine leads to satisfactory results but leaves room for further improvement.
(ii) it is more difficult to detect texts generated by conditioned text generation models compared to open-ended generation.
(iii) determining the source text generation model is more difficult than determining whether the text was machine-generated.
(iv) fine-tuned detectors are not robust towards the size of the text generation model. Larger models are more difficult to detect.
(v) fine-tuned detectors are not robust towards the unseen text domain.
These observations underscore the challenge of implementing a trustworthy detector in real-life applications, where there is no information available about the domain and potential text generation model.
In our future work, we aim to explore ATD tasks in the multilingual setting. Another direction is to analyse the effect of the human evaluation design on the performance, e.g., the varying number of training examples and providing the input texts in the sequence-to-sequence tasks.
7. Ethical considerations
Crowd-sourcing annotation
Responses of human annotators are collected and stored anonymously. The average annotation pay is double the hourly minimum wage in Russia. The annotators are warned about potentially sensitive topics in data (e.g., politics, culture, and religion).
Social and ethical risks
The scope of risks associated with the misuse of generative language models is widely discussed in the community (Weidinger et al. Reference Weidinger, Mellor, Rauh, Griffin, Uesato, Huang, Cheng, Glaese, Balle and Kasirzadeh2021; Bommasani et al. Reference Bommasani, Hudson, Adeli, Altman, Arora, von Arx, Bernstein, Bohg, Bosselut and Brunskill2021). This problem has been addressed from the perspective of responsible artificial intelligence development: researchers and developers create new regulations and licenses (Contractor et al. Reference Contractor, McDuff, Haines, Lee, Hines, Hecht, Vincent and Li2022), require outputs from the TGMs to be marked as “generated,” and propose “watermarking” techniques to determine generated content (Kirchenbauer et al. Reference Kirchenbauer, Geiping, Wen, Katz, Miers and Goldstein2023). While our goal is to propose a novel large-scale ATD resource for the Russian language, we understand that the results of our work can be used maliciously, e.g., to reduce the performance of the detectors. However, we believe that CoAT will contribute to the development of more generalisable detectors for non-English languages.
8. Limitations
Data collection
Learnable artificial text detection methods require human-written and machine-generated texts. The design of the ATD resources is inherently limited by the availability of diverse text domains and generative language models. In particular, such resources may suffer from decreasing inclusivity of the models due to the rapid development of the field of natural language generation. While we have addressed the diversity of the corpus in terms of the TGMs and text domains, it is crucial to continuously update CoAT to keep it up-to-date with the recent TGMs and conduct additional evaluation experiments.
Decoding strategies
The choice of the decoding strategy affects the quality of generated texts (Ippolito et al. Reference Ippolito, Kriz, Sedoc, Kustikova and Callison-Burch2019) and the performance of artificial text detectors (Holtzman et al. Reference Holtzman, Buys, Du, Forbes and Choi2020). The design of CoAT does not account for the diversity of decoding strategies, limiting the scope of the detectors’ evaluation. We leave this aspect for future work.
Acknowledgments
We acknowledge the computational resources of HPC facilities at the HSE University. We thank Elena Tutubalina, Daniil Cherniavskii, and Ivan Smurov for their contribution to the project in the early stages. We also appreciate including the earlier CoAT version in the SemEval-2024 Task 8 corpus (Wang et al. Reference Wang, Mansurov, Ivanov, su, Shelmanov, Tsvigun, Mohammed Afzal, Mahmoud, Puccetti, Arnold, Whitehouse, Aji, Habash, Gurevych and Nakov2024b) for developing generalisable multi-domain, multimodel, and multilingual detectors.
Competing interests
The author(s) declare none.
Appendix A. Annotation protocols
Task overview. Choose between two judgements on the given text:
• The text is written by a human;
• The text is generated by an AI system.
Detailed task description. Follow the steps below:
• Carefully read the given text;
• Think about who could write this text;
• If you suppose that the text is written by a human, choose the Human option;
• If you suppose that the text is generated by an AI system, choose the AI option;
Examples.
Input text: A eto moja semja moja semja moja semja (And this is my family my family my family). Choose the AI option. The text contains obvious unnatural repetitions.
Choose the AI option. The text contains obvious unnatural repetitions.
Input text: The cat has managed to keep the behaviour pattern inherent in its wild ancestors. She is almost as good at hunting as a wild cat, but at the same time she is able to peacefully coexist with a person, show him emotional attachment, tenderness, or even show playful behaviour. (Koshka sumela sohranit’ model’ povedenija, prisushhuju ejo dikim predkam. Ona pochti tak zhe horosho ohotitsja, kak dikaja koshka, no v to zhe vremja sposobna mirno sosushhestvovat’ s chelovekom, projavljat’ k nemu jemocional’nuju privjazannost’, nezhnost’ ili dazhe vykazyvat’ igrivoe povedenie.)
Choose the Human option. The text sounds plausible and does not contain semantic violations.
Tips. You will get texts from multiple sources and in multiple genres. Texts may look like samples from newspapers, research papers, social media. Following features help recognise generated texts:
• Inconsistent facts and incoherent writing;
• Violation of common sense and world knowledge;
• Unnecessary receptions and abrupt ending.
Following features are NOT helpful and can be present in human and AI texts:
• Spelling errors;
• Style fluency. Modern AI mimic human well and can write human-like texts in any genre. It is easy to be fooled by an AI system, which is able to write a research paper!
Appendix B. Human performance
Annotators show mixed results depending on the task. Generally, larger models are harder to detect than smaller ones. However, there is significant variation in accuracy when detecting different models.