Questionnaires for eliciting evaluation data from users of interactive question answering systems

D. KELLY; P. B. KANTOR; E. L. MORSE; J. SCHOLTZ; Y. SUN

doi:10.1017/S1351324908004932

Questionnaires for eliciting evaluation data from users of interactive question answering systems

Published online by Cambridge University Press: 01 January 2009

D. KELLY ,

P. B. KANTOR ,

E. L. MORSE ,

J. SCHOLTZ and

Y. SUN

Show author details

D. KELLY*: Affiliation:
University of North Carolina, Chapel Hill, NC 27599-3360, USA e-mail: [email protected]
P. B. KANTOR: Affiliation:
Rutgers University, New Brunswick, NJ 08901, USA e-mail: [email protected]
E. L. MORSE: Affiliation:
National Institute of Standards & Technology, Gaithersburg, MD 20899, USA e-mail: [email protected]
J. SCHOLTZ: Affiliation:
Pacific Northwest National Laboratory, Richland, WA 99352, USA e-mail: [email protected]
Y. SUN: Affiliation:
University at Buffalo, The State University of New York, Buffalo, NY 14260, USA e-mail: [email protected]
*: †To whom all correspondences should be addressed.

Article contents

Abstract
References

Get access

Rights & Permissions

Abstract

Evaluating interactive question answering (QA) systems with real users can be challenging because traditional evaluation measures based on the relevance of items returned are difficult to employ since relevance judgments can be unstable in multi-user evaluations. The work reported in this paper evaluates, in distinguishing among a set of interactive QA systems, the effectiveness of three questionnaires: a Cognitive Workload Questionnaire (NASA TLX), and Task and System Questionnaires customized to a specific interactive QA application. These Questionnaires were evaluated with four systems, seven analysts, and eight scenarios during a 2-week workshop. Overall, results demonstrate that all three Questionnaires are effective at distinguishing among systems, with the Task Questionnaire being the most sensitive. Results also provide initial support for the validity and reliability of the Questionnaires.

Type: Papers
Information: Natural Language Engineering , Volume 15 , Special Issue 1: Interactive Question Answering , January 2009 , pp. 119 - 141

DOI: https://doi.org/10.1017/S1351324908004932 [Opens in a new window]
Copyright: Copyright © Cambridge University Press 2008

Access options

Get access to the full version of this content by using one of the access options below. (Log in options will check for institutional or personal access. Content may require purchase if you do not have access.)

Article purchase

Temporarily unavailable

References

Borlund, P. 2003a. The IIR evaluation model: a framework for evaluation of interactive information retrieval systems. Information Research 8 (3), Paper no. 152.Google Scholar

Borlund, P. 2003b. The concept of relevance in IR. Journal of the American Society for Information Science 54 (10), 913–925.CrossRef Google Scholar

Chin, J. P., Diehl, V. A., and Norman, K. L. 1988. Development of an instrument measuring user satisfaction of the human–computer interface. In Proceedings of the ACM SIGCHI Conference on Human Factors in Computing Systems (CHI '88), Washington, DC, pp. 213–218.Google Scholar

Cowley, P., Haack, J., Littlefield, R., and Hampson, E. 2006. Glass box: capturing, archiving and retrieving workstation activities. In Proceedings of the second ACM Workshop on Continuous Archival and Retrieval of Personal Experiences (CARPE '05), Santa Barbara, CA, pp. 13–18.Google Scholar

Cowley, P., Nowell, L., and Scholtz, J. 2005. Glass box: an instrumented infrastructure for supporting human interaction with information. In Proceedings of the 38th Hawaii International Conference on System Sciences, Waikoloa, Hawaii.Google Scholar

Dang, H., Lin, J., and Kelly, D. 2007. Overview of the TREC 2006 question answering track. In Voorhees, E., and Buckland, L. P. (eds.), TREC2006, Proceedings of the Fifteenth Text Retrieval Conference, Washington, DC. GPO.Google Scholar

den Os, E., and Bloothooft, G. 1998. Evaluating various spoken language dialogue systems with a single questionnaire: analysis of the ELSNET Olympics. In Proceedings of the First International Conference on Language Resources and Evaluation (LREC '98), Granada, Spain, pp. 51–54.Google Scholar

Diekema, A. R., Yilmazel, O., Chen, J., Harwell, S., He, L., and Liddy, E. D. 2004. Finding answers to complex questions. In Maybury, M. T. (ed.), New Directions in Question Answering, pp. 141–152. Cambridge, MA: MIT Press.Google Scholar

Dumais, S. T., and Belkin, N. J. 2005. The TREC Interactive Tracks: putting the user into search. In Voorhees, E. M., and Harman, D. K. (eds.), TREC: Experiment and Evaluation in Information Retrieval, pp. 123–153. Cambridge, MA: MIT Press.Google Scholar

Harabagiu, S., Hickl, A., Lehmann, J., and Moldovan, D. 2005. Experiments with interactive question-answering. In Proceedings of the 43rd Annual Meeting of the Association for Computational Linguistics (ACL), Ann Arbor, MI, pp. 205–214.Google Scholar

Hart, S. G., and Staveland, L. E. 1988. Development of a NASA-TLX (task load index): results of empirical and theoretical research. In Hancock, P., and Meshkati, N. (eds.), Human Mental Workload, Amsterdam, North-Holland, pp. 139–183.CrossRef Google Scholar

Hersh, W. 2006. Evaluating interactive question answering. In Strzalkowski, T., and Harabagiu, S. (eds.), Advances in Open Domain Question Answering, pp. 431–455. Dordrecht, The Netherlands: Springer.Google Scholar

Hersh, W., and Over, P. 2001. Introduction to a special issue on interactivity at the Text Retrieval Conference (TREC). Information Processing and Management 37 (3): 365–367.Google Scholar

Kelly, D., Kantor, P., Morse, E. L., Scholtz, J., and Sun, Y. 2006. User-centered evaluation of interactive question answering systems. In Proceedings of the Workshop on Interactive Question Answering at the Human Language Technology Conference (HLT-NAACL '06), New York, NY.CrossRef Google Scholar

Kelly, D., Wacholder, N., Rittman, R., Sun, Y., Kantor, P., Small, S., and Strzalkowski, T. 2007. Using interview data to identify evaluation criteria for interactive, analytical question answering systems. Journal of the American Society for Information Science and Technology 58 (7): 1032–1043.CrossRef Google Scholar

Liddy, E. D., Diekema, A. R., and Yilmazel, O. 2004. Context-based question-answering evaluation. In Proceedings of the 27th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR '04), Sheffield, UK, pp. 508–509.Google Scholar

Likert, R. 1932. A technique for the measurement of attitudes. Archives of Psychology 140: 1–55.Google Scholar

Maiorano, S. J. 2006. Question answering: technology for intelligence analysis. In Strzalkowski, T., and Harabagiu, S. (eds.), Advances in Open Domain Question Answering, pp. 477–504. Dordrecht, The Netherlands: Springer.Google Scholar

Maybury, M. T. 2004. Question answering: an introduction. In Maybury, M. T. (ed.), New Directions in Question Answering, pp. 3–14. Cambridge, MA: MIT Press.Google Scholar

Small, S., Strzalkowski, T., Janack, T., Liu, T., Ryan, S., Salkin, R., Shimizu, N., Kantor, P., Kelly, D., Rittman, R., Wacholder, N., and Yamrom, B. 2004. HITIQA: scenario-based question answering. In Proceedings of the Workshop on Pragmatics of Question Answering at HLT-NAACL 2004, Boston, MA, pp. 52–59.Google Scholar

Strzalkowski, T., and Harabagiu, S. 2006. Advances in Open Domain Question Answering. Dordrecht, The Netherlands: Springer.Google Scholar

Sun, Y., and Kantor, P. 2006. Cross-evaluation: a new model for information system evaluation. Journal of American Society for Information Science and Technology 56 (5): 614–628.CrossRef Google Scholar

Tague-Sutcliffe, J. 1992. The pragmatics of information retrieval experimentation, revisted. Information Processing and Management 28 (4): 467–490.CrossRef Google Scholar

Voorhees, E. M. 2005. Question answering in TREC. In Voorhees, E. M., and Harman, D. K. (eds.), TREC: Experiment and Evaluation in Information Retrieval, pp. 233–257. Cambridge, MA: MIT Press.Google Scholar

Voorhees, E. M., and Harman, D. K. 2005. TREC: Experiment and Evaluation in Information Retrieval. Cambridge, MA: MIT Press.Google Scholar

Wacholder, N., Kelly, D., Rittman, R., Sun, Y., Kantor, P., Small, S., and Strzalkowski, T. 2007. A model for realistic evaluation of an end-to-end question answering system. Journal of the American Society for Information Science and Technology 58 (8): 1082–1099.CrossRef Google Scholar

Walker, M. A., Litman, D. J., Kamm, C. A., and Abella, A. 1997. PARADISE: A framework for evaluating spoken dialogue agents. In Proceedings of the 35th Annual Meeting of the Association for Computational Linguistics (ACL '97), Madrid, Spain, pp. 271–280.Google Scholar

Woods, W. A., Kaplan, R. M., and Nash-Webber, B. L. 1972. The Lunar Sciences Natural Language Information System: Final Report, BBN Report 2378, Cambridge, MA.Google Scholar

Article contents

Questionnaires for eliciting evaluation data from users of interactive question answering systems

Abstract

Access options

Article purchase

Temporarily unavailable

References

Save article to Kindle

Save article to Dropbox

Save article to Google Drive

Reply to: Submit a response

Your details

You have entered the maximum number of contributors

Conflicting interests