Book contents
- The Cambridge Handbook of Arabic Linguistics
- Cambridge Handbooks in Language and Linguistics
- The Cambridge Handbook of Arabic Linguistics
- Copyright page
- Contents
- Figures
- Tables
- Notes on Contributors
- Acknowledgements
- Abbreviations
- Introduction
- Part I Arabic Applied Linguistics
- Part II Arabic Variation and Sociolinguistics
- Part III Theoretical and Descriptive Studies
- Part IV Arabic Computational and Corpus Linguistics
- 18 Arabic Computational Linguistics
- 19 Arabic Corpus Linguistics and Related Tools
- 20 The Utility of Arabic Corpus Linguistics
- Part V Arabic Linguistics and New Media Studies
- Part VI Arabic Linguistics in Literature and Translation
- Index
- References
19 - Arabic Corpus Linguistics and Related Tools
An Overview and Some Critical Observations
from Part IV - Arabic Computational and Corpus Linguistics
Published online by Cambridge University Press: 23 September 2021
- The Cambridge Handbook of Arabic Linguistics
- Cambridge Handbooks in Language and Linguistics
- The Cambridge Handbook of Arabic Linguistics
- Copyright page
- Contents
- Figures
- Tables
- Notes on Contributors
- Acknowledgements
- Abbreviations
- Introduction
- Part I Arabic Applied Linguistics
- Part II Arabic Variation and Sociolinguistics
- Part III Theoretical and Descriptive Studies
- Part IV Arabic Computational and Corpus Linguistics
- 18 Arabic Computational Linguistics
- 19 Arabic Corpus Linguistics and Related Tools
- 20 The Utility of Arabic Corpus Linguistics
- Part V Arabic Linguistics and New Media Studies
- Part VI Arabic Linguistics in Literature and Translation
- Index
- References
Summary
Mark Van Mol provides a critical review of the issues involved in the construction of usable Arabic corpora and the solutions that programmers have attempted in resolving them. One such issue is whether a corpus is made freely available or is placed behind a paywall. This distinction often translates into corpus size, as well, with freely available corpora generally being larger and untagged for parts of speech (POS) and those hidden behind paywalls being smaller and POS-tagged. The reason for this is clear: POS tagging requires large amounts of painstaking labour; on the other hand, scouring large amounts of text from the Internet with web scrubber applications can be done in seconds. As for corpus size, different qualifications make it difficult to compare. Size may be expressed in the number of articles, hours, tokens, kilobytes, megabytes, sentences, words, and sometimes paragraphs that the corpus encompasses. One of the reasons for this is that defining the searchable units of Arabic texts presents complications. Such considerations pertain directly to questions of corpus representativeness. With that arises the question of the nature of the phenomenon under scrutiny, whether the corpora are intended to represent Classical Arabic, modern written Arabic, or Arabic dialects.
Keywords
- Type
- Chapter
- Information
- The Cambridge Handbook of Arabic Linguistics , pp. 446 - 472Publisher: Cambridge University PressPrint publication year: 2021