Comparative evaluation and shared tasks for NLG in interactive systems

doi:10.1017/CBO9780511844492.013

13 - Comparative evaluation and shared tasks for NLG in interactive systems

from Part V - Evaluation and shared tasks

Published online by Cambridge University Press: 05 July 2014

Anja Belz and

Helen Hastie

Edited by

Amanda Stent and

Srinivas Bangalore

Show author details

Anja Belz: Affiliation:
University of Brighton
Helen Hastie: Affiliation:
Heriot-Watt University
Amanda Stent: Affiliation:
AT&T Research, Florham Park, New Jersey
Srinivas Bangalore: Affiliation:
AT&T Research, Florham Park, New Jersey

Book contents

Get access

Summary

Introduction

Natural Language Generation (NLG) has strong evaluation traditions, in particular in the area of user evaluation of NLG-based application systems, as conducted for example in the M-PIRO (Isard et al., 2003), COMIC (Foster and White, 2005), and SumTime (Reiter and Belz, 2009) projects. There are also examples of embedded evaluation of NLG components compared to non-NLG baselines, including, e.g., the DIAG (Di Eugenio et al., 2002), STOP (Reiter et al., 2003b), and SkillSum (Williams and Reiter, 2008) evaluations, and of different versions of the same component, e.g., in the ILEX (Cox et al., 1999), SPoT (Rambow et al., 2001), and CLASSiC (Janarthanam et al., 2011) projects. Starting with Langkilde and Knight's work (Knight and Langkilde, 2000), automatic evaluation against reference texts also began to be used, especially in surface realization. What was missing, until 2006, were comparative evaluation results for directly comparable, but independently developed, NLG systems.

In 1981, Spärck Jones wrote that information retrieval (IR) lacked consolidation and the ability to progress collectively, and that this was substantially because there was no commonly agreed framework for describing and evaluating systems (Sparck Jones, 1981, p. 245). Since then, various sub-disciplines of natural language processing (NLP) and speech technology have consolidated results and progressed collectively through developing common task definitions and evaluation frameworks, in particular in the context of shared-task evaluation campaigns (STECs), and have achieved successful commercial deployment of a range of technologies (e.g. speech recognition software, document retrieval, and dialogue systems).

Type: Chapter
Information: Natural Language Generation in Interactive Systems , pp. 302 - 350

DOI: https://doi.org/10.1017/CBO9780511844492.013 [Opens in a new window]

Publisher: Cambridge University Press

Print publication year: 2014

Access options

Get access to the full version of this content by using one of the access options below. (Log in options will check for institutional or personal access. Content may require purchase if you do not have access.)

Book purchase

Temporarily unavailable

References

Ai, H., Raux, A., Bohus, D., Eskenzai, M., and Litman, D. (2007). Comparing spoken dialog corpora collected with recruited subjects versus real users. In Proceedings of the SIGdial Workshop on Discourse and Dialogue (SIGDIAL), pages 124-131, Antwerp, Belgium. Association for Computational Linguistics.Google Scholar

Androutsopoulos, I., Kallonis, S., and Karkaletsis, V. (2005). Exploiting OWL ontologies in the multilingual generation of object description. In Proceedings of the European Workshop on Natural Language Generation (ENLG), pages 150-155, Aberdeen, Scotland. Association for Computational Linguistics.Google Scholar

Angeli, G., Liang, P., and Klein, D. (2010). A simple domain-independent probabilistic approach to generation. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 502-512, Boston, MA. Association for Computational Linguistics.Google Scholar

Bagga, A. and Baldwin, B. (1998). Algorithms for scoring coreference chains. In Proceedings of the International Conference on Language Resources and Evaluation (LREC), pages 563-566, Granada, Spain. European Language Resources Association.Google Scholar

Basile, V. and Bos, J. (2011). Towards generating text from discourse representation structures. In Proceedings of the European Workshop on Natural Language Generation (ENLG), pages 145-150, Nancy, France. Association for Computational Linguistics.Google Scholar

Belz, A. (2007). Probabilistic generation of weather forecast texts. In Proceedings ofHuman Language Technologies: The Conference of the North American Chapter of the Association for Computational Linguistics (HLT-NAACL), pages 164-171, Rochester, NY. Association for Computational Linguistics.Google Scholar

Belz, A. (2009). Prodigy-METEO: Pre-alpha release notes. Technical Report NLTG-09-01, Natural Language Technology Group, CMIS, University of Brighton.Google Scholar

Belz, A. (2010). GREC named entity recognition and GREC named entity regeneration challenges 2010: Participants' pack. Technical Report NLTG-10-01, Natural Language Technology Group, University of Brighton.Google Scholar

Belz, A. and Kow, E. (2009). System building cost vs. output quality in data-to-text generation. In Proceedings of the European Workshop on Natural Language Generation (ENLG), pages 16-24, Athens, Greece. Association for Computational Linguistics.Google Scholar

Belz, A., Kow, E., Viethen, J., and Gatt, A. (2009). The GREC main subject reference generation challenge 2009: Overview and evaluation results. In Proceedings of the Workshop on Language Generation and Summarisation, pages 79-87, Suntec, Singapore. Association for Computational Linguistics.Google Scholar

Belz, A. and Reiter, E. (2006). Comparing automatic and human evaluation of NLG systems. In Proceedings of the Conference of the European Chapter of the Association for Computational Linguistics (EACL), pages 313-320, Trento, Italy. Association for Computational Linguistics.Google Scholar

Belz, A., White, M., Espinosa, D., Kow, E., Hogan, D., and Stent, A. (2011). The first surface realisation shared task: Overview and evaluation results. In Proceedings of the Generation Challenges Session at the European Workshop on Natural Language Generation, pages 217-226, Nancy, France. Association for Computational Linguistics.Google Scholar

Black, A. W., Burger, S., Conkie, A., Hastie, H., Keizer, S., Lemon, O., Merigaud, N., Parent, G., Schubiner, G., Thomson, B., Williams, J. D., Yu, K., Young, S., and Eskenazi, M. (2011). Spoken dialog challenge 2010: Comparison of live and control test results. In Proceedings of the SIGdial Conference on Discourse and Dialogue (SIGDIAL), pages 2-7, Portland, OR. Association for Computational Linguistics.Google Scholar

Bohnet, B. and Dale, R. (2005). Viewing referring expression generation as search. In Proceedings of the International Joint Conference on Artificial Intelligence (IJCAI), pages 1004-1009, Edinburgh, Scotland. International Joint Conference on Artificial Intelligence.Google Scholar

Bohnet, B., Wanner, L., Mille, S., and Burga, A. (2010). Broad coverage multilingual deep sentence generation with a stochastic multi-level realizer. In Proceedings of the International Conference on Computational Linguistics (COLING), pages 98-106, Beijing, China. International Committee on Computational Linguistics.Google Scholar

Bonneau-Maynard, H., Devillers, L., and Rosset, S. (2000). Predictive performance of dialog systems. In Proceedings of the International Conference on Language Resources and Evaluation (LREC), Athens, Greece. European Language Resources Association.Google Scholar

Boyd, A. and Meurers, D. (2011). Data-driven correction of function words in non-native English. In Proceedings of the European Workshop on Natural Language Generation (ENLG), pages 267-269, Nancy, France. Association for Computational Linguistics.Google Scholar

Cahill, A. and van Genabith, J. (2006). Robust PCFG-based generation using automatically acquired LFG approximations. In Proceedings of the International Conference on Computational Linguistics and the Annual Meeting of the Association for Computational Linguistics (COLING-ACL), pages 1033-1040, Sydney, Australia. Association for Computational Linguistics.Google Scholar

Callaway, C. B. (2003). Do we need deep generation of disfluent dialogue? In Working Papers of the AAAISpring Symposium on Natural Language Generation in Spoken and Written Dialogue, pages 6-11, Stanford, CA. AAAI Press.Google Scholar

Cox, R., O'Donnell, M., and Oberlander, J. (1999). Dynamic versus static hypermedia in museum education: An evaluation of ILEX, the intelligent labelling explorer. In Proceedings of the Conference on Artificial Intelligence in Education, pages 181–188, Le Mans, France. International Artificial Intelligence in Education Society.Google Scholar

Dale, R. (1989). Cooking up referring expressions. In Proceedings of the Annual Meeting of the Association for Computational Linguistics (ACL), pages 68–75, Vancouver, Canada. Association for Computational Linguistics.Google Scholar

Dale, R. (1990). Generating recipes: An overview of Epicure. In Dale, R., Mellish, C., and Zock, M., editors, Current Research in Natural Language Generation, pages 229-255. Academic Press, San Diego, CA, USA.Google Scholar

Dale, R. and Kilgarriff, A. (2011). Helping our own: The HOO 2011 pilot shared task. In Proceedings of the European Workshop on Natural Language Generation (ENLG), pages 242–249, Nancy, France. Association for Computational Linguistics.Google Scholar

Dale, R. and Reiter, E. (1995). Computational interpretation of the Gricean maxims in the generation of referring expressions. Cognitive Science, 19(2):233-263.CrossRef Google Scholar

Dethlefs, N. and Cuayáhuitl, H. (2011). Combining hierarchical reinforcement learning and Bayesian networks for natural language generation in situated dialogue. In Proceedings of the European Workshop on Natural Language Generation (ENLG), pages 110–120, Nancy, France. Association for Computational Linguistics.Google Scholar

Di Eugenio, B. (2007). Shared tasks and comparative evaluation for NLG: To go ahead, or not to go ahead? In Proceedings of the NSF Workshop on SharedTasks andComparative Evaluation in Natural Language Generation, Arlington, VA. National Science Foundation.Google Scholar

Di Eugenio, B., Glass, M., and Trolio, M. J. (2002). The DIAG experiments: Natural language generation for intelligent tutoring systems. In Proceedings of the International Conference on Natural Language Generation (INLG), pages 120–127, Arden Conference Center, NY. Association for Computational Linguistics.Google Scholar

Doddington, G. (2002). Automatic evaluation of machine translation quality using n-gram co-occurrence statistics. In Proceedings of the Human Language Technology Conference (HLT), pages 138–145, San Diego, CA. Morgan Kaufmann.Google Scholar

Eckert, W., Levin, E., and Pieraccini, R. (1997). User modeling for spoken dialogue system evaluation. In Proceedings of the IEEE workshop on Automatic Speech Recognition and Understanding (ASRU), pages 80–87, Santa Barbara, CA. Institute of Electrical and Electronics Engineers.Google Scholar

Forbes-Riley, K. and Litman, D. (2009). Adapting to student uncertainty improves tutoring dialogues. In Proceedings of the Artificial Intelligence in Education Conference (AIED), pages 33–40, Brighton, UK. IOS Press.Google Scholar

Foster, M. E. and White, M. (2005). Assessing the impact of adaptive generation in the COMIC multimodal dialogue system. In Proceedings of the Workshop on Knowledge and Reasoning in Practical Dialogue Systems (KRPDS), pages 24–31, Edinburgh, Scotland. International Joint Conference on Artificial Intelligence.Google Scholar

Gatt, A. and Belz, A. (2010). Introducing shared tasks to NLG: The TUNA shared task evaluation challenges. In Krahmer, E. and Theune, M., editors, Empirical Methods in Natural Language Generation, pages 264-293. Springer, Berlin, Heidelberg.Google Scholar

Gatt, A., van der Sluis, I., and van Deemter, K. (2007). Evaluating algorithms for the generation of referring expressions using a balanced corpus. In Proceedings of the European Workshop on Natural Language Generation (ENLG), pages 49–56, Saarbrücken, Germany. Association for Computational Linguistics.Google Scholar

Goldberg, E., Driedger, N., and Kittredge, R. I. (1994). Using natural-language processing to produce weather forecasts. IEEE Expert: Intelligent Systems and Their Applications, 9(2): 45-53.CrossRef Google Scholar

Gupta, S. and Stent, A. (2005). Automatic evaluation of referring expression generation using corpora. In Proceedings of the Workshop on Using Corpora for Natural Language Generation (UCNLG), Birmingham, UK. ITRI, University of Brighton.Google Scholar

Hartikainen, M., Salonen, E.-P., and Turunen, M. (2004). Subjective evaluation of spoken dialogue systems using SERVQUAL method. In Proceedings of the International Conference on Spoken Language Processing (INTERSPEECH), pages 2273–2276, Jeju Island, Korea. International Speech Communication Association.Google Scholar

Henderson, J., Lemon, O., and Georgila, K. (2008). Hybrid reinforcement/supervised learning of dialogue policies from fixed datasets. Computational Linguistics, 34(4): 487-513.CrossRef Google Scholar

Isard, A., Oberlander, J., Androutsopoulos, I., and Matheson, C. (2003). Speaking the users' languages. IEEE Intelligent Systems Magazine: Special Issue “Advances in Natural Language Processing”, 18(1):40-45.Google Scholar

Janarthanam, S., Hastie, H., Lemon, O., and Liu, X. (2011). “The day after the day after tomorrow?”: A machine learning approach to adaptive temporal expression generation: Training and evaluation with real users. In Proceedings of the SIGdial Conference on Discourse and Dialogue (SIGDIAL), pages 142–151, Portland, OR. Association for Computational Linguistics.Google Scholar

Janarthanam, S. and Lemon, O. (2009). A Wizard of Oz environment to study referring expression generation in a situated spoken dialogue task. In Proceedings of the European Workshop on Natural Language Generation (ENLG), pages 94–97, Athens, Greece. Association for Computational Linguistics.Google Scholar

Janarthanam, S. and Lemon, O. (2010). Learning to adapt to unknown users: Referring expression generation in spoken dialogue systems. In Proceedings of the Annual Meeting of the Association for Computational Linguistics (ACL), pages 69–78, Uppsala, Sweden. Association for Computational Linguistics.Google Scholar

Janarthanam, S. and Lemon, O. (2011). The GRUVE challenge: Generating routes under uncertainty in virtual environments. In Proceedings of the European Workshop on Natural Language Generation (ENLG), pages 208–211, Nancy, France. Association for Computational Linguistics.Google Scholar

Johansson, R. and Nugues, P. (2007). Extended constituent-to-dependency conversion for English. In Proceedings of the 16th Nordic Conference on Computational Linguistics, pages 105–112, Tartu, Estonia. Northern European Association for Language Technology.Google Scholar

Jordan, P. W. (2000). Can nominal expressions achieve multiple goals? An empirical study. In Proceedings of the Annual Meeting of the Association for Computational Linguistics (ACL), pages 142–149, Hong Kong. Association for Computational Linguistics.Google Scholar

Jordan, P. W. and Walker, M. A. (2005). Learning content selection rules for generating object descriptions in dialogue. Journal of Artificial Intelligence Research, 24(1): 157-194.Google Scholar

Jurcicek, F., Keizer, S., Gasic, M., Mairesse, F., Thomson, B., Yu, K., and Young, S. (2011). Real user evaluation of spoken dialogue systems using Amazon mechanical turk. In Proceedings of the International Conference on Spoken Language Processing (INTERSPEECH), pages 3061–3064, Florence, Italy. International Speech Communication Association.Google Scholar

Karasimos, A. and Isard, A. (2004). Multi-lingual evaluation of a natural language generation systems. In Proceedings of the International Conference on Language Resources and Evaluation (LREC), Lisbon, Portugal. European Language Resources Association.Google Scholar

Keeney, R. L. and Raiffa, H. (1976). Decisions with Multiple Objectives: Preferences and Value Tradeoffs. John Wiley & Sons, New York, NY.Google Scholar

Knight, K. and Langkilde, I. (2000). Preserving ambiguities in generation via automata intersection. In Proceedings of the National Conference on Artificial Intelligence and the Conference on Innovative Applications of Artificial Intelligence (AAAI/IAAI), pages 697–702, Austin, TX. AAAI Press.Google Scholar

Koller, A., Striegnitz, K., Gargett, A., Byron, D., Cassell, J., Dale, R., Moore, J., and Oberlander, J. (2010). Report on the second NLG challenge on generating instructions in virtual environments (GIVE-2). In Proceedings of the International Conference on Natural Language Generation (INLG), pages 243–250, Trim, Ireland. Association for Computational Linguistics.Google Scholar

Lamel, L., Rosset, S., Gauvain, J.-L., Bennacef, S., Garnier-Rizet, M., and Prouts, B. (2000). The LIMSI ARISE system. Speech Communication, 31(4):339-354.CrossRef Google Scholar

Langkilde-Geary, I. (2002). An empirical verification of coverage and correctness for a generalpurpose sentence generator. In Proceedings of the International Conference on Natural Language Generation (INLG), pages 17–24, Arden Conference Center, NY. Association for Computational Linguistics.Google Scholar

Langner, B. (2010). Data-driven Natural Language Generation: Making Machines Talk Like Humans Using Natural Corpora. PhD thesis, Language Technologies Institute, School of Computer Science, Carnegie Mellon University.Google Scholar

Lavie, A. and Agarwal, A. (2007). METEOR: An automatic metric for MT evaluation with high levels of correlation with human judgments. In Proceedings of the ACL Workshop on Statistical Machine Translation, pages 228–231, Prague, Czech Republic. Association for Computational Linguistics.Google Scholar

Lemon, O. (2008). Adaptive natural language generation in dialogue using reinforcement learning. In Proceedings of the Workshop on the Semantics and Pragmatics of Dialogue (SEMDIAL), pages 141–148, London, UK. SemDial.Google Scholar

Liu, X., Rieser, V., and Lemon, O. (2009). A Wizard of Oz interface to study information presentation strategies for spoken dialogue systems. In Proceedings of the Europe-Asia Spoken Dialogue Systems Technology Workshop, Kloster Irsee, Germany.Google Scholar

Luo, X. (2005). On coreference resolution performance metrics. In Proceedings of the Joint Human Language Technology Conference and Conference on Empirical Methods in Natural Language Processing (HLT-EMNLP), pages 25–32, Vancouver, Canada. Association for Computational Linguistics.Google Scholar

Marcus, M., Kim, G., Marcinkiewicz, M. A., MacIntyre, R., Bies, A., Ferguson, M., Katz, K., and Schasberger, B. (1994). The Penn Treebank: Annotating predicate argument structure. In Proceedings of the Human Language Technology Conference (HLT), pages 114–119, Plainsboro, NJ. Association for Computational Linguistics.Google Scholar

Meyers, A., Reeves, R., Macleod, C., Szekely, R., Zielinska, V., Young, B., and Grishman, R. (2004). The NomBank project: An interim report. In Proceedings of the NAACL/HLT Workshop on Frontiers in Corpus Annotation, pages 24–31, Boston, MA. Association for Computational Linguistics.Google Scholar

Moller, S. and Ward, N. G. (2008). A framework for model-based evaluation of spoken dialog systems. In Proceedings of the SIGdial Workshop on Discourse and Dialogue (SIGDIAL), pages 182–189, Columbus, OH. Association for Computational Linguistics.Google Scholar

Nakanishi, H., Miyao, Y., and Tsujii, J. (2005). Probabilistic models for disambiguation of an HPSG-based chart generator. In Proceedings of the International Workshop on Parsing Technologies, pages 93–102, Vancouver, Canada. Association for Computational Linguistics.Google Scholar

Palmer, M., Gildea, D., and Kingsbury, P. (2005). The Proposition Bank: An annotated corpus of semantic roles. Computational Linguistics, 31(1):71-105.CrossRef Google Scholar

Papineni, K., Roukos, S., Ward, T., and Zhu, W.-J. (2002). BLEU: A method for automatic evaluation of machine translation. In Proceedings of the Annual Meeting of the Association for Computational Linguistics (ACL), pages 311–318, Philadelphia, PA. Association for Computational Linguistics.Google Scholar

Passonneau, R. (2006). Measuring agreement on set-valued items (MASI) for semantic and pragmatic annotation. In Proceedings of the International Conference on Language Resources and Evaluation (LREC), Genoa, Italy. European Language Resources Association.Google Scholar

Rahim, M., Di Fabbrizio, G., Kamm, C., Walker, M. A., Pokrovsky, A., Ruscitti, P., Levin, E., Lee, S., Syrdal, A., and Schlosser, K. (2001). Voice-IF: A mixed-initiative spoken dialogue system for AT & T conference services. In Proceedings of the European Conference on Speech Communication and Technology (EUROSPEECH), pages 1339–1342, Aalborg, Denmark. International Speech Communication Association.Google Scholar

Rambow, O., Rogati, M., and Walker, M. A. (2001). Evaluating a trainable sentence planner for a spoken dialogue system. In Proceedings of the Annual Meeting of the Association for Computational Linguistics (ACL), pages 426–433, Toulouse, France. Association for Computational Linguistics.Google Scholar

Reiter, E. and Belz, A. (2006). GENEVAL: A proposal for shared-task evaluation in NLG. In Proceedings of the International Workshop on Natural Language Generation (INLG), pages 136–138, Sydney, Australia. Association for Computational Linguistics.Google Scholar

Reiter, E. and Belz, A. (2009). An investigation into the validity of some metrics for automatically evaluating NLG systems. Computational Linguistics, 35(4):529-558.CrossRef Google Scholar

Reiter, E., Robertson, R., and Osman, L. M. (2003a). Lessons from a failure: Generating tailored smoking cessation letters. Artificial Intelligence, 144(1-2):41-58.CrossRef Google Scholar

Reiter, E., Sripada, S., Hunter, J., and Davy, I. (2005). Choosing words in computer-generated weather forecasts. Artificial Intelligence, 167:137-169.CrossRef Google Scholar

Reiter, E., Sripada, S., and Robertson, R. (2003b). Acquiring correct knowledge for natural language generation. Journal of Artificial Intelligence Research, 18:491-516.Google Scholar

Rieser, V. and Lemon, O. (2010). Natural language generation as planning under uncertainty for spoken dialogue systems. In Krahmer, E. and Theune, M., editors, Empirical Methods in Natural Language Generation, pages 105-120. Springer, Berlin, Heidelberg.Google Scholar

Sambaraju, R., Reiter, E., Logie, R., McKinlay, A., McVittie, C., Gatt, A., and Sykes, C. (2011). What is in a text and what does it do: Qualitative evaluations of an NLG system – the BT-Nurse – using content analysis and discourse analysis. In Proceedings of the European Workshop on Natural Language Generation (ENLG), pages 22–31, Nancy, France. Association for Computational Linguistics.Google Scholar

Schatzmann, J., Georgila, K., and Young, S. (2005). Quantitative evaluation of user simulation techniques for spoken dialogue systems. In Proceedings of the SIGdial Workshop on Discourse and Dialogue (SIGDIAL), pages 45–54, Lisbon, Portugal. Association for Computational Linguistics.Google Scholar

Schmitt, A., Schatz, B., and Minker, W. (2011). Modeling and predicting quality in spoken human-computer interaction. In Proceedings of the SIGdial Conference on Discourse and Dialogue (SIGDIAL), pages 173–184, Portland, OR. Association for Computational Linguistics.Google Scholar

Scott, D. and Moore, J. (2007). An NLG evaluation competition? Eight reasons to be cautious. In Proceedings of the NSF Workshop on Shared Tasks and Comparative Evaluation in Natural Language Generation, Arlington, VA. National Science Foundation.Google Scholar

Snover, M., Dorr, B., Schwartz, R., Micciulla, L., and Makhoul, J. (2006). A study of translation edit rate with targeted human annotation. In Proceedings of the Association for Machine Translation in the Americas (AMTA), pages 223–231, Boston, MA. Association for Machine Translation in the Americas.Google Scholar

Sparck Jones, K. (1981). Retrieval system tests 1958-1978. In Sparck Jones, K., editor, Information Retrieval Experiment, pages 213-255. Butterworths, London.Google Scholar

Sparck Jones, K. (1994). Towards better NLP system evaluation. In Proceedings of the Human Language Technology Conference (HLT), pages 102–107, Plainsboro, NJ. Association for Computational Linguistics.Google Scholar

Sripada, S. G., Reiter, E., Hunter, J., and Yu, J. (2002). SUMTIME-METEO: A parallel corpus of naturally occurring forecast texts and weather data. Technical Report AUCS/TR0201, Computing Science Department, University of Aberdeen.Google Scholar

Stent, A., Prasad, R., and Walker, M. A. (2004). Trainable sentence planning for complex information presentation in spoken dialog systems. In Proceedings of the Annual Meeting of the Association for Computational Linguistics (ACL), pages 79–86, Barcelona, Spain. Association for Computational Linguistics.Google Scholar

Suendermann, D., Liscombe, J., and Pieraccini, R. (2010). Contender. In Proceedings of the Spoken Language Technology Conference (SLT), pages 330–335, Berkeley, CA. Institute of Electrical and Electronics Engineers.Google Scholar

Surdeanu, M., Johansson, R., Meyers, A., Marquez, L., and Nivre, J. (2008). The CoNLL-2008 shared task on joint parsing of syntactic and semantic dependencies. In Proceedings of the Conference on Computational Natural Language Learning (CoNLL), pages 159–177, Manchester, UK. Association for Computational Linguistics.Google Scholar

Tokunaga, T., Iida, R., Yasuhara, M., Terai, A., Morris, D., and Belz, A. (2010). Construction of bilingual multimodal corpora of referring expressions in collaborative problem solving. In Proceedings of the Workshop on Asian Language Resources, pages 38–46, Beijing, China. Chinese Information Processing Society of China.Google Scholar

van Deemter, K., Gatt, A., van der Sluis, I., and Power, R. (2012). Generation of referring expressions: Assessing the Incremental Algorithm. Cognitive Science, 36(5): 799-836.Google Scholar PubMed

van der Sluis, I., Gatt, A., and van Deemter, K. (2007). Evaluating algorithms for the generation of referring expressions: Going beyond toy domains. In Proceedings of the International Conference on Recent Advances in Natural Language Processing (RANLP), Borovets, Bulgaria. Recent Advances in Natural Language Processing.Google Scholar

Viethen, J. and Dale, R. (2006). Algorithms for generating referring expressions: Do they do what people do? In Proceedings of the International Conference on Natural Language Generation (INLG), pages 63–72, Sydney, Australia. Association for Computational Linguistics.Google Scholar

Vilain, M., Burger, J., Aberdeen, J., Connolly, D., and Hirschman, L. (1995). A model-theoretic coreference scoring scheme. In Proceedings of the Message Understanding Conference, pages 45–52, Columbia, MD. Defense Advanced Research Projects Agency.Google Scholar

Walker, M., Rudnicky, A. I., Aberdeen, J., Bratt, E. O., Garofolo, J., Hastie, H., Le, A., Pellom, B., Potamianos, A., Passonneau, R., Prasad, R., Roukos, S., Sanders, G., Seneff, S., and Stallard, D. (2002a). Darpa Communicator evaluation: Progress from 2000 to 2001. In Proceedings of the International Conference on Spoken Language Processing (INTERSPEECH), pages 273–276, Denver, CO. International Speech Communication Association.Google Scholar

Walker, M. A., Aberdeen, J., Boland, J., Bratt, E. O., Garofolo, J., Hirschman, L., Le, A., Lee, S., Narayanan, S., Papineni, K., Pellom, B., Polifroni, J., Potamianos, A., Prabhu, P., Rudnicky, A. I., Sanders, G., Seneff, S., Stallard, D., and Whittaker, S. (2001a). Darpa Communicator dialog travel planning systems: The June 2000 data collection. In Proceedings of the European Conference on Speech Communication and Technology (EUROSPEECH), pages 1371–1374, Aalborg, Denmark. International Speech Communication Association.Google Scholar

Walker, M. A., Kamm, C., and Litman, D. (2000). Towards developing general models of usability with PARADISE. Natural Language Engineering, 6(3-4):363-377.CrossRef Google Scholar

Walker, M. A., Passonneau, R., and Boland, J. (2001b). Quantitative and qualitative evaluation of Darpa Communicator spoken dialogue systems. In Proceedings of the Annual Meeting of the Association for Computational Linguistics (ACL), pages 515–522, Toulouse, France. Association for Computational Linguistics.Google Scholar

Walker, M. A., Rudnicky, A. I., Prasad, R., Aberdeen, J., Bratt, E. O., Garofolo, J., Hastie, H., Le, A., Pellom, B., Potamianos, A., Passonneau, R., Roukos, S., Sanders, G., Seneff, S., and Stallard, D. (2002b). Darpa Communicator: Cross-system results for the 2001 evaluation. In Proceedings of the International Conference on Spoken Language Processing (INTERSPEECH), pages 269–272, Denver, CO. International Speech Communication Association.Google Scholar

White, M. and Rajkumar, R. (2009). Perceptron reranking for CCG realization. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 410-419, Singapore. Association for Computational Linguistics.Google Scholar

White, M., Rajkumar, R., and Martin, S. (2007). Towards broad coverage surface realization with CCG. In Proceedings of the Workshop on Using Corpora for NLG: Language Generation and Machine Translation. Association for Computational Linguistics.Google Scholar

Williams, J. D. and Young, S. (2007). Partially observable Markov decision processes for spoken dialog systems. Computer Speech and Language, 21(2):393-422.CrossRef Google Scholar

Williams, S. and Reiter, E. (2008). Generating basic skills reports for low-skilled readers. Natural Language Engineering, 14(4):495-525.CrossRef Google Scholar

Yang, Z., Li, B., Zhu, Y., King, I., Levow, G., and Meng, H. (2010). Collection of user judgments on spoken dialog system with crowdsourcing. In Proceedings of the Spoken Language Technology Conference (SLT), pages 277–282, Berkeley, CA. Institute of Electrical and Electronics Engineers.Google Scholar

Zhong, H. and Stent, A. (2005). Building surface realizers automatically from corpora using general-purpose tools. In Proceedings of the Workshop on Using Corpora for Natural Language Generation (UCNLG), Birmingham, UK. ITRI, University of Brighton.Google Scholar