Recent developments in language assessment and the case of four large-scale tests of ESOL ability

Stephen Stoynoff

doi:10.1017/S0261444808005399

Recent developments in language assessment and the case of four large-scale tests of ESOL ability

Published online by Cambridge University Press: 01 January 2009

Stephen Stoynoff

Show author details

Stephen Stoynoff*: Affiliation:
Minnesota State University, Mankato, [email protected]

Article contents

Abstract
References

Get access

Rights & Permissions

Abstract

This review article surveys recent developments and validation activities related to four large-scale tests of L2 English ability: the iBT TOEFL, the IELTS, the FCE, and the TOEIC. In addition to describing recent changes to these tests, the paper reports on validation activities that were conducted on the measures. The results of this research constitute some of the evidence available to support claims that these tests are suitable for their intended purposes. The discussion is organized on the basis of a framework that considers test purpose, selected test method characteristics, and important aspects of test usefulness.

Type: State-of-the-Art Article
Information: Language Teaching , Volume 42 , Issue 1 , January 2009 , pp. 1 - 40

DOI: https://doi.org/10.1017/S0261444808005399 [Opens in a new window]
Copyright: Copyright © Cambridge University Press 2008

Access options

Get access to the full version of this content by using one of the access options below. (Log in options will check for institutional or personal access. Content may require purchase if you do not have access.)

Article purchase

Temporarily unavailable

References

Alderson, J. C. (2000). Assessing reading. New York: Cambridge University Press.Google Scholar

Alderson, J. C. (2004). The shape of things to come: Will it be the normal distribution? In Milanovic, & Weir, (eds.), 1–26.Google Scholar

ALTE (Association of Language Testers in Europe) (2001). Code of practice. http://www.alte.org.Google Scholar

AERA (American Educational Research Association), APA (American Psychological Association) & NCME (National Council on Measurement in Education) (1999). Standards for educational and psychological testing. Washington, DC: AERA.Google Scholar

Bachman, L. F. (1990). Fundamental considerations in language testing. New York: Oxford University Press.Google Scholar

Bachman, L. F. (2000). Modern language testing at the turn of the century: Assuring that what we count counts. Language Testing 17.1, 1–42.CrossRef Google Scholar

Bachman, L. F. (2002). Some reflections on task-based language performance assessments. Language Testing 19.4, 453–476.Google Scholar

Bachman, L. F. (2005). Building and supporting a case for test use. Language Assessment Quarterly 2.1, 1–34.CrossRef Google Scholar

Bachman, L. F., Davidson, F., Ryan, K. & Choi, I. C. (1995). An investigation of the comparability of two tests of English as a foreign language ability. New York: UCLES/Cambridge University Press.Google Scholar

Bachman, L. F. & Palmer, A. S. (1996). Language testing in practice. New York: Oxford University Press.Google Scholar

Barker, F. (2006). Corpora and language assessment: Trends and prospects. Research Notes 26, 2–4.Google Scholar

Barker, F., McKenna, S., Murray, S. & Vidakovic, I. (2007). Overview of FCE and CAE review project research. Research Notes 30, 31–34.Google Scholar

Bejar, I., Douglas, D., Jamieson, J., Nissan, S. & Turner, J. (2000). TOEFL 2000 Listening Framework: A working paper (TOEFL Monograph 19). Princeton, NJ: Educational Testing Service.Google Scholar

Biber, D., Conrad, S., Byrd, P. & Helt, M. (2002). Speaking and writing in the university: A multidimensional comparison. TESOL Quarterly 36. 1, 9–48.CrossRef Google Scholar

Blackhurst, A. (2004). IELTS test performance data 2003. Research Notes 18, 18–20.Google Scholar

Blackhurst, A. (2005). Listening, reading, and writing on computer-based and paper-based versions of IELTS. Research Notes 21, 14–17.Google Scholar

Brown, A., Iwashita, N., McNamara, T. & O'Hagan, S. (2005). An examination of rater orientations and test-taker performance on English-for-academic-purposes speaking tasks (TOEFL Monograph 29). Princeton, NJ: Educational Testing Service.CrossRef Google Scholar

Buck, G. (2001). Assessing listening. Cambridge: Cambridge University Press.CrossRef Google Scholar

Butler, F., Eignor, D., Jones, D., McNamara, T. & Suomi, B. (2000). TOEFL 2000 Speaking Framework: A working paper (TOEFL Monograph 20). Princeton, NJ: Educational Testing Service.Google Scholar

Cambridge ESOL (2003). IELTS handbook. Cambridge: Cambridge ESOL.Google Scholar

Cambridge ESOL (2004). The IELTS joint-funded program celebrates a decade of research. Research Notes 18, 20–21.Google Scholar

Cambridge ESOL (2005). IELTS specimen materials. Cambridge: Cambridge ESOL.Google Scholar

Cambridge ESOL (2006). IELTS test performance data 2004. Research Notes 23, 13–15.Google Scholar

Cambridge ESOL (2007a). Cambridge ESOL FCE. http://www.cambridgeesol.org/exams/general-english/fce.html.Google Scholar

Cambridge ESOL (2007b). FCE handbook for teachers. Cambridge: Cambridge ESOL.Google Scholar

Cambridge ESOL (2007c). Research notes. http://www.cambridgeesol.org/rs_notes.Google Scholar

Cambridge ESOL (not dated). IELTS 2007 examinees. http://www.ielts.org.Google Scholar

Canagarajah, S. (2006). Changing communicative needs, revised assessment objectives: Testing English as an international language. Language Assessment Quarterly 3.3, 229–242.Google Scholar

Canale, M. & Swain, M. (1980). Theoretical bases of communicative approaches to second language teaching and testing. Applied Linguistics 1.1, 1–47.Google Scholar

Chalhoub-Deville, M. (2003). Second language interaction: Current perspectives and future trends. Language Testing 20.4, 369–383.CrossRef Google Scholar

Chalhoub-Deville, M. & Deville, C. (2005). A look back at and forward to what language testers measure. In Hinkel, (ed.), 815–831.Google Scholar

Chalhoub-Deville, M. & Turner, C. E. (2000). What to look for in ESL admission tests: Cambridge certificate exams, IELTS, and TOEFL. System 28.4, 523–539.Google Scholar

Chapelle, C. A. & Douglas, D. (2006). Assessing language ability through computer technology. New York: Cambridge University Press.CrossRef Google Scholar

Chapelle, C. A., Enright, M. K. & Jamieson, J. M. (eds.) (2008). Building a validity argument for the Test of English as a Foreign Language. New York: Routledge.Google Scholar

Chapelle, C. A., Grabe, W. & Berns, M. (1997). Communicative language proficiency: Definitions and implications for TOEFL 2000 (TOEFL Monograph 10). Princeton, NJ: Educational Testing Service.Google Scholar

Chapelle, C. A., Jamieson, J. M. & Hegelheimer, V. (2003). Validation of a web-based ESL test. Language Testing 20.4, 409–439.CrossRef Google Scholar

Cheng, L., Watanabe, Y. & Curtis, A. (eds.) (2004). Washback in language testing: Research contexts and methods. London: Erlbaum.CrossRef Google Scholar

Choi, I., Kim, K. & Boo, J. (2003). Comparability of a paper-based language test and a computer-based language test. Language Testing 20.3, 295–320.Google Scholar

Clapham, C. (1996). The development of the IELTS: A study of the effect of background knowledge on reading comprehension. Cambridge: UCLES/Cambridge University Press.Google Scholar

Cohen, A. D. & Upton, T. (2006). Strategies in responding to the New TOEFL Reading tasks (TOEFL Monograph 33). Princeton, NJ: Educational Testing Service.CrossRef Google Scholar

Cooze, M. & Shaw, S. (2007). Establishing the impact of reduced input and output length in FCE and CAE writing. Research Notes 30, 15–19.Google Scholar

Crow, C. & Hubbard, C. (2006). ESOL Professional Support Network (PSN) Extranet. Research Notes 23, 6–7.Google Scholar

Council of Europe (2001). Common European Framework of Reference for Languages. New York: Cambridge University Press.Google Scholar

Cumming, A., Kantor, R., Baba, K., Erdosy, U. & James, M. (2006). Analysis of discourse features and verification of scoring levels for independent and integrated prototype written tasks for next generation TOEFL (TOEFL Monograph 30). Princeton, NJ: Educational Testing Service.Google Scholar

Cumming, A., Kantor, R., Powers, D., Santos, T. & Taylor, C. (2000). TOEFL 2000 Writing Framework: A working paper (TOEFL Monograph 18). Princeton, NJ: Educational Testing Service.Google Scholar

Davies, A., Hamp-Lyons, L. & Kemp, C. (2003). Whose norms? International proficiency tests in English. World Englishes 22.4, 571–584.CrossRef Google Scholar

Douglas, D. (2000). Assessing languages for specific purposes. New York: Cambridge University Press.Google Scholar

Douglas, D. & Hegelheimer, V. (2007). Assessing language using computer technology. Annual Review of Applied Linguistics 27, 115–132.Google Scholar

Eckes, T., Ellis, M., Kalnberzina, V., Pižorn, K., Springer, C., Szollás, K. & Tsagari, C. (2005). Progress and problems in reforming language examinations in Europe: Cameos from the Baltic States, Greece, Hungary, Poland, Slovenia, France and Germany. Language Testing 22.3, 355–377.CrossRef Google Scholar

Educational Testing Service (2002). The ETS standards for quality and fairness. Princeton, NJ: Educational Testing Service.Google Scholar

Educational Testing Service (2003a). The ETS fairness review guidelines. Princeton, NJ: Educational Testing Service.Google Scholar

Educational Testing Service (2003b). TOEIC from A–Z. Princeton, NJ: Educational Testing Service.Google Scholar

Educational Testing Service (2004). English language competency descriptors. Princeton, NJ: Educational Testing Service.Google Scholar

Educational Testing Service (2006). The official guide to the new TOEFL iBT. New York: McGraw-Hill.Google Scholar

Educational Testing Service (2007a). ETS and ELS, world leaders in education, join forces. http://www.ets.org.Google Scholar

Educational Testing Service (2007b). TOEFL iBT reliability and generalizability of scores. Princeton, NJ: Educational Testing Service.Google Scholar

Educational Testing Service (2007c). TOEIC examinee handbook: Listening and reading. Princeton, NJ: Educational Testing Service.Google Scholar

Educational Testing Service (2007d). New TOEIC listening and Reading. http://www.ets.org/portal/site/ets/menuitem.Google Scholar

Educational Testing Service (2007e). TOEIC speaking and writing tests. Princeton, NJ: Educational Testing Service.Google Scholar

Elder, C. & Davies, A. (2006). Assessing English as a lingua franca. Annual Review of Applied Linguistics 23, 282–301.Google Scholar

Embretson, S. E. (2007). Construct validity: A universal validity system or just another test evaluation procedure? Educational Researcher 36.8, 449–455.CrossRef Google Scholar

Enright, M., Grabe, W., Koda, K., Mosenthal, P., Mulcahy-Ernt, P. & Schedl, M. (2000). TOEFL 2000 Reading Framework: A working paper (TOEFL Monograph 17). Princeton, NJ: Educational Testing Service.Google Scholar

Falvey, P. & Shaw, S. (2006). IELTS writing: Revising assessment criteria and scales (Phase 5). Research Notes 23, 7–12.Google Scholar

Fried-Booth, D. (2007). Reviewing Part 1 of the FCE Listening Test. Research Notes 30, 23–24.Google Scholar

Fulcher, G. (2003). Testing second language speaking. London: Pearson.Google Scholar

Fulcher, G. (2004). Deluded by artifices? The Common European Framework and harmonization. Language Assessment Quarterly 1.4, 253–266.CrossRef Google Scholar

Galaczi, E. (2005). Upper Main Suite speaking assessments: Towards an understanding of assessment criteria and oral examiner behaviour. Research Notes 20, 16–19.Google Scholar

Green, T. & Maycock, L. (2004). Computer-based IELTS and paper-based versions of IELTS. Research Notes 18, 3–6.Google Scholar

Gutteridge, M. (2006). ESOL special circumstances 2004: A review of upper main suite provision. Research Notes 23, 17–19.Google Scholar

Hamp-Lyons, L. (2000). Social, professional, and individual responsibility in language testing. System 28.4, 579–591.CrossRef Google Scholar

Hawkey, R. (2006). Impact theory and practice. Cambridge: UCLES/Cambridge University Press.Google Scholar

Hinkel, E. (ed.) (2005). Handbook of research in second language teaching and learning. Mahwah, NJ: Erlbaum.Google Scholar

Hughes, G. (2006). The effect of editing on language used in the FCE Reading texts: A case study. Research Notes 26, 19–21.Google Scholar

Jamieson, J. [M.] (2005). Trends in computer-based second language assessment. Annual Review of Applied Linguistics 25, 228–242.CrossRef Google Scholar

Jamieson, J. M., Jones, S., Kirsch, I., Mosenthal, P. & Taylor, C. A. (2000). TOEFL 2000 Framework: A working paper (TOEFL Monograph 16). Princeton, NJ: Educational Testing Service.Google Scholar

Jenkins, J. (2006a). Current perspectives on teaching world Englishes and English as a lingua franca. TESOL Quarterly 40.1, 157–181.Google Scholar

Jenkins, J. (2006b). The spread of EIL: A testing time for testers. ELT Journal 60.1, 42–50.CrossRef Google Scholar

Joint Committee on Testing Practices. (2004). Code of fair testing practices in education. Washington, DC: AERA.Google Scholar

Kane, M. (1992). An argument-based approach to validation. Psychological Bulletin 112, 527–535.CrossRef Google Scholar

Kane, M. (2002). Validating high-stakes testing programs. Educational Measurement: Issues and Practice 21.2, 31–41.CrossRef Google Scholar

Kane, M. (2004). Certification testing as an illustration of argument-based validation. Measurement 2.3, 135–170.Google Scholar

Kennedy, C. & Thorp, D. (2007). A corpus-based investigation of linguistic responses to an IELTS Academic Writing task. In Taylor, & Falvey, (eds.), 316–377.Google Scholar

Kern, R. (2006). Perspectives on technology in learning and teaching languages. TESOL Quarterly 40.1, 183–210.CrossRef Google Scholar

Kramsch, C. (1993). Context and culture in language teaching. Oxford: Oxford University Press.Google Scholar

Kunnan, A. J. (2004). Test fairness. In Milanovic, & Weir, (eds.), 27–48.Google Scholar

Lazaraton, A. (1996). Interlocutor support in oral proficiency interviews: The case of CASE. Language Testing 13.2, 151–172.Google Scholar

Lee, Y. (2006). Dependability of scores for a new ESL speaking assessment consisting of integrated and independent tasks. Language Testing 23.2, 13–166.Google Scholar

Lee, Y. & Kantor, R. (2005). Dependability of new ESL writing test scores: Evaluating prototype tasks and alternative rating schemes (TOEFL Monograph 31). Princeton, NJ: Educational Testing Service.Google Scholar

Leeson, H. (2006). The mode effect: A literature review of human and technological issues in computerized testing. International Journal of Testing 6.1, 1–24.CrossRef Google Scholar

Leung, C. & Lewkowicz, J. (2006). Expanding horizons and unsolved conundrums: Language testing and assessment. TESOL Quarterly 40.1, 211–234.CrossRef Google Scholar

Lumley, T. & Brown, A. (2005). Research methods in language testing. In Hinkel, (ed.), 833–855.Google Scholar

Maycock, L. & Green, T. (2005). The effects on performance of computer familiarity and attitudes towards CB IELTS. Research Notes 20, 3–8.Google Scholar

Mayor, B., Hewings, A., North, S., Swann, J. & Coffin, C. (2007). A linguistic analysis of Chinese and Greek L1 scripts for IELTS Academic Writing Task 2. In Taylor, & Falvery, (eds.), 250–313.Google Scholar

McNamara, T. F. (1996). Measuring second language performance. New York: Longman.Google Scholar

Messick, S. (1989). Validity. In Linn, R. (ed.), Educational measurement (3rd edn.). New York: Macmillan, 13–103.Google Scholar

Milanovic, M. & Weir, C. (eds.) (2004). European language testing in a global context. New York: UCLES/Cambridge University Press.Google Scholar

Mislevy, R., Steinberg, L. & Almond, R. (2002). Design and analysis in task-based language assessment. Language Testing 19.4, 477–496.CrossRef Google Scholar

Mislevy, R., Steinberg, L., Breyer, F., Almond, R. & Johnson, L. (2002). Making sense of data from complex assessments. Applied Measurement in Education 15.4, 363–389.CrossRef Google Scholar

Moore, T. & Morton, J. (2007). Authenticity in the IELTS Academic Module Writing test: A comparative study of Task 2 items and university assignments. In Taylor, & Falvey, (eds.), 197–248.Google Scholar

Moss, P. A. (2007). Reconstructing validity. Educational Researcher 36.8, 470–476.Google Scholar

Murray, S. (2007). Broadening the cultural context of examination materials. Research Notes 27, 19–22.Google Scholar

Norris, J. (2002). Interpretations, intended uses and designs in task-based language assessment. Language Testing 19.4, 337–346.Google Scholar

Norris, J., Brown, J. D., Hudson, T. & Bonk, W. (2002). Examinee abilities and task difficulty in task-based second language performance assessment. Language Testing 19.4, 395–418.Google Scholar

Qi, L. (2005). Stakeholders' conflicting aims undermine the washback function of a high-stakes test. Language Testing 22.2, 142–173.Google Scholar

O'Loughlin, K. & Wigglesworth, G. (2007). Investigating task design in Academic Writing prompts. In Taylor, & Falvey, (eds.), 379–419.Google Scholar

Orr, M. (2002). The FCE speaking test: Using rater reports to help interpret test scores. System 30.2, 143–154.CrossRef Google Scholar

O'Sullivan, B. (2002). Learner acquaintanceship and OPI pair-task performance. Language Testing 19.3, 277–295.CrossRef Google Scholar

O'Sullivan, B. (2004). Modelling factors affecting oral language test performance: A large-scale study. In Milanovic, & Weir, (eds.), 129–142.Google Scholar

O'Sullivan, B. (2005). International English Language Testing System. In Stoynoff, S. & Chapelle, C. A. (eds.), ESOL tests and testing. Alexandria, VA: TESOL, 73–86.Google Scholar

Purpura, J. (2004). Assessing grammar. New York: Cambridge University Press.Google Scholar

Read, J. & Chapelle, C. A. (2001). A framework for second language vocabulary assessment. Language Testing 18.1, 1–32.CrossRef Google Scholar

Rosenfeld, M., Leung, S. & Oltman, P. (2001). The Reading, Writing, Speaking, and Listening tasks important for academic success at the undergraduate and graduate levels (TOEFL Monograph 21). Princeton, NJ: Educational Testing Service.Google Scholar

Saville, N. & Hawkey, R. (2004). The IELTS impact study: Investigating washback on teaching materials. In Cheng, et al. (eds.), 73–96.Google Scholar

Sawaki, Y., Stricker, L. & Oranje, A. (2008). Factor structure of the TOEFL Internet-based test: Exploration in a field trial sample (TOEFL iBT Research Report 04). Princeton, NJ: Educational Testing Service.Google Scholar

Spolsky, B. (1995). Measured words. New York: Oxford University Press.Google Scholar

Tannenbaum, R. & Wylie, E. C. (2005). Mapping English language proficiency scores onto the Common European Framework (TOEFL Research Reports 80). Princeton, NJ: Educational Testing Service.CrossRef Google Scholar

Taylor, L. (2006). The changing landscape of English: Implications for language assessment. ELT Journal 60.1, 51–60.CrossRef Google Scholar

Taylor, L. & Falvey, P. (2007). IELTS collected papers: Research in speaking and writing assessment. Cambridge: UCLES/Cambridge University Press.Google Scholar

Thighe, D. (2007). Cambridge ESOL and tests of English for specific purposes. Research Notes 27, 2–4.Google Scholar

Turner, C. E. (2006). Professionalism and high-stakes tests: Teachers' perspectives when dealing with educational change introduced through provincial exams. TESL Canada Journal 23.2, 54–76.CrossRef Google Scholar

Wall, D. (2005). The impact of high-stakes examinations on classroom teaching. Cambridge: UCLES/Cambridge University Press.Google Scholar

Wall, D. & Horák, T. (2006). The impact of changes in the TOEFL examination on teaching and learning in Central and Eastern Europe: Phase 1, the baseline study (TOEFL Monograph 34). Princeton, NJ: Educational Testing Service.Google Scholar

Weigle, S. (2002). Assessing writing. New York: Cambridge University Press.Google Scholar

Weir, C. J. (2005a). Limitations of the Common European Framework for developing comparable examinations and tests. Language Testing 22.3, 281–300.Google Scholar

Weir, C. J. (2005b). Language testing and validation: An evidence-based approach. Basingstoke: Palgrave.Google Scholar

Weir, C. J. & Shaw, S. D. (2005). Establishing the validity of Cambridge ESOL writing tests: Towards the implementation of a socio-cognitive model for test validation. Research Notes 21, 10–14.Google Scholar

Wilson, K. (1999). Validating a test designed to assess ESL proficiency at lower developmental levels (ETS Research Report 99-23). Princeton, NJ: Educational Testing Service.Google Scholar

Wilson, K. (2000). An exploratory dimensionality assessment of the TOEIC test (Research Report 00-14). Princeton, NJ: Educational Testing Service.Google Scholar

Woodford, P. (1982). An introduction to TOEIC: The initial validity study (TOEIC Research Summaries 00). Princeton, NJ: Educational Testing Service.Google Scholar

Young, R. (2000). Interactional competence: Challenges for validity. Presented at the Language Testing Research Colloquium, 11 March 11 2000, Vancouver, Canada. http://www.wisc.edu/english/rfyoung.Google Scholar

Zenisky, A. & Sireci, S.. (2002). Technological innovations in large-scale assessment. Applied Measurement in Education 15.4, 337–362.CrossRef Google Scholar

Zhang, S. (2006). Investigating the relative effects of persons, items, sections, and languages on TOEIC score dependability. Language Testing 23.3, 351–369.Google Scholar

Article contents

Recent developments in language assessment and the case of four large-scale tests of ESOL ability

Abstract

Access options

Article purchase

Temporarily unavailable

References

Save article to Kindle

Save article to Dropbox

Save article to Google Drive

Reply to: Submit a response

Your details

You have entered the maximum number of contributors

Conflicting interests