Hostname: page-component-586b7cd67f-t7czq Total loading time: 0 Render date: 2024-11-25T23:13:29.876Z Has data issue: false hasContentIssue false

Experimenting with a computer essay-scoring program based on ESL student writing scripts

Published online by Cambridge University Press:  01 May 2009

David Coniam
Affiliation:
Dept of Curriculum and Instruction, Faculty of Education, The Chinese University of Hong Kong, Sha Tin, Hong Kong (email: [email protected])

Abstract

This paper describes a study of the computer essay-scoring program BETSY. While the use of computers in rating written scripts has been criticised in some quarters for lacking transparency or lack of fit with how human raters rate written scripts, a number of essay rating programs are available commercially, many of which claim to offer comparable reliability with human raters. Much of the validation of such programs has focused on native-speaking tertiary-level students writing in subject content areas. Instead of content areas with native-speakers, the data for this study is drawn from a representative sample of scripts from an English as a second language (ESL) Year 11 public examination in Hong Kong. The scripts (900 in total) are taken from a writing test consisting of three topics (300 scripts per topic), each representing a different genre. Results in the study show good correlations between human raters’ scores and the program BETSY. A rater discrepancy rate, where scripts need to be re-marked because of disagreement between two raters, emerged at levels broadly comparable with those derived from discrepancies between paired human raters. Little difference was apparent in the ratings of test takers on the three genres. The paper concludes that while computer essay-scoring programs may appear to rate inside a ‘black box’ with concomitant lack of transparency, they do have potential to act as a third rater, time-saving assessment tool. And as technology develops and rating becomes more transparent, so will their acceptability.

Type
Regular papers
Copyright
Copyright © European Association for Computer Assisted Language Learning 2009

Access options

Get access to the full version of this content by using one of the access options below. (Log in options will check for institutional or personal access. Content may require purchase if you do not have access.)

References

Alderson, J. C. (2000) Technology in testing: The present and the future. System, 28: 593603.CrossRefGoogle Scholar
Attali, Y.Burstein, J. (2006) Automated essay scoring with e-rater v.2. Journal of Technology, Learning, and Assessment, (JTLA) 4(3).Google Scholar
Burstein, J. (2003) The e-rater scoring engine: Automated essay scoring with natural language processing. In: Shermis, M. D. and Burstein, J. (eds.), Automated essay scoring: A cross-disciplinary perspective. Mahwah, NJ: Lawrence Erlbaum Associates, Inc., 113122.Google Scholar
Chalhoub-Deville, M.Deville, C. (1999) Computer adaptive testing in second language contexts. Annual Review of Applied Linguistics, 19: 273299.CrossRefGoogle Scholar
Chapelle, C. A.Douglas, D. (2006) Assessing Language through Computer Technology. Cambridge, UK: Cambridge University Press.CrossRefGoogle Scholar
Cheung, W. K., Mørch, A. I., Wong, K. C., Lee, C., Liu, J.Lam, M. H. (2005) Grounding Collaborative Learning in Semantics-Based Critiquing. In: Rynson, W. H., Lau Qing Li, Cheung, R. and Wenyin Liu (eds.), Advances in Web-based learning – ICWL 2005. New York: Springer, 244255.Google Scholar
Chung, G. K., O’Neil, H. F., Jr. (1997) Methodological Approaches to Online Scoring of Essays. ERIC Document Reproduction Service No. ED 418 101.Google Scholar
Clapham, C. (2000) Assessment and testing. Annual Review of Applied Linguistics, 20: 147161.CrossRefGoogle Scholar
Coniam, D. (2005) Raw scores as examination results: How far can they be relied upon? Paper presented at the ALTE Second International Conference, Berlin, 19–21 May 2005.Google Scholar
Coniam, D. (1998) Voice recognition software accuracy with second language speakers of English. System, 27(1): 116.Google Scholar
Coniam, D. (1999) Second language proficiency and word frequency in English. Asian Journal of English Language Teaching, 9: 5974.Google Scholar
Dikli, S. (2006) An Overview of Automated Scoring of Essays. Journal of Technology, Learning, and Assessment, 5(1). http://escholarship.bc.edu/jtla/Google Scholar
Drechsel, J. (1999) Writing into Silence: Losing Voice with Writing Assessment Technology. Teaching English in the Two-Year College, 26(4): 380387.Google Scholar
Foltz, P. W., Kintsch, W.Landauer, T. K. (1998) The measurement of textual coherence with Latent Semantic Analysis. Discourse Processes, 25(2&3): 285307.CrossRefGoogle Scholar
Foltz, P. W., Laham, D.Landauer, T. K. (1999) The Intelligent Essay Assessor: Applications to Educational Technology. Interactive Multimedia Electronic Journal of Computer-Enhanced Learning, 1(2). http://imej.wfu.edu/articles/1999/2/04/index.aspGoogle Scholar
Hay, J. (1982) General Impression Marking: Some Caveats. English in Australia, 59: 5058.Google Scholar
Hatch, E.Lazaraton, A. (1991) The Research Manual: Design and Statistics for Applied Linguistics. Boston, MA: Heinle and Heinle.Google Scholar
Hong Kong Examinations and Assessment Authority (HKEAA) (2006) Language Proficiency Assessment for Teachers (English Language) 2006: Assessment Report. http://eant01.hkeaa.edu.hk/hkea/redirector.asp?p_direction=body&p_clickurl=otherexam%5Fbycategory%2EaspGoogle Scholar
Hong Kong Examinations and Assessment Authority (HKEAA) (2007) HKCEE English language examination report and question papers. Hong Kong: Hong Kong Examinations and Assessment Authority.Google Scholar
Hughes, A. (2003) Testing for language teachers. Cambridge, UK: Cambridge University Press.Google Scholar
Hunt, K. W. (1970) Syntactic maturity in school children and adults. Monographs of the Society for Research in Child Development, 135(35/1). Chicago: University of Chicago Press.Google Scholar
Jamieson, J. (2005) Research in language assessment: trends in computer-based second language assessment. Annual Review of Applied Linguistics, 25: 228242.CrossRefGoogle Scholar
Landauer, T. K., Foltz, P. W.Laham, D. (1998) Introduction to Latent Semantic Analysis. Discourse Processes, 25: 259284.CrossRefGoogle Scholar
Legislative Council Panel on Education (2005) LC Paper No. CB(2)323/05-06(01): Grant to support the modernization and development of the Hong Kong Examinations and Assessment Authority’s examination systems. http://www.legco.gov.hk/yr05-06/english/panels/ed/papers/ed1114cb2-323-1e.pdfGoogle Scholar
Linacre, J. M. (1994) FACETS: Rasch Measurement Computer Program. Chicago: MESA Press.Google Scholar
Linacre, J. M. (1997) Communicating examinee measures as expected ratings. Rasch Measurement Transactions, 11(1): 550551. Retrieved October 11, 2007, from http://www.rasch.org/rmt/rmt111m.htmGoogle Scholar
Nerbonne, J. (2003) Natural Language Processing in Computer-Assisted Language Learning. In: Mitkov, R. (ed.), The Oxford Handbook of Computational Linguistics. Oxford: Oxford University Press, 670698.Google Scholar
McNamara, T. (1996) Measuring second language performance. New York: Longman.Google Scholar
Page, E. B. (2003) Project Essay Grade: PEG. In: Shermis, M. D. and Burstein, J. (eds.), Automated essay scoring: A cross-disciplinary perspective. Mahwah, NJ: Lawrence Erlbaum Associates, 4354.Google Scholar
Page, E. B., Poggio, J. P. Keith, T. Z. (1997) Computer analysis of student essays: Finding trait differences in the student profile. AERA/NCME Symposium on Grading Essays by Computer.Google Scholar
Powers, D. E., Burstein, J. C., Chodorow, M., Fowles, M. E.Kukich, K. (2001) Stumping e-rater: Challenging the validity of automated essay scoring (GRE Board Professional Rep. No. 98–08bP, ETS Research Rep. No. 01–03). Princeton, NJ: Educational Testing Service.CrossRefGoogle Scholar
Powers, D. E., Burstein, J. C., Chodorow, M. S., Fowles, M. E.Kukich, K. (2002) Comparing the validity of automated and human scoring of essays. Journal of Educational Computing Research, 26(4): 407425.CrossRefGoogle Scholar
Rudner, L. M.Liang, T. (2002) Automated essay scoring using Bayes’ theorem. The Journal of Technology, Learning, and Assessment, 1(2). http://www.jtla.orgGoogle Scholar
Valenti, S., Neri, F.Cucchiarelli, A. (2003) An overview of current research on automated essay grading. Journal of Information Technology Education, (2), 319330.Google Scholar
Warschauer, M.Healey, D. (1998) Computers and language learning: An overview. Language Teaching, 31: 5771.CrossRefGoogle Scholar
Warschauer, M.Ware, P. (2006) Automated writing evaluation: Defining the classroom research agenda. Language Teaching Research, 10(2): 124.CrossRefGoogle Scholar
Weigle, S. C. (2002) Assessing writing. Cambridge: Cambridge University Press.CrossRefGoogle Scholar