INVESTIGATING INTER-RATER RELIABILITY OF QUALITATIVE TEXT ANNOTATIONS IN MACHINE LEARNING DATASETS

N. El Dehaibi; E. F. MacDonald

doi:10.1017/dsd.2020.153

INVESTIGATING INTER-RATER RELIABILITY OF QUALITATIVE TEXT ANNOTATIONS IN MACHINE LEARNING DATASETS

Part of: Design Theory and Research Methods

Published online by Cambridge University Press: 11 June 2020

N. El Dehaibi and

E. F. MacDonald

Show author details

N. El Dehaibi*: Affiliation:
Stanford University, United States of America
E. F. MacDonald: Affiliation:
Stanford University, United States of America
*: *[email protected]

Article contents

Abstract
References

Abstract

Core share and HTML view are not available for this content. However, as you have access to this content, a full PDF is available via the ‘Save PDF’ action button.

An important step when designers use machine learning models is annotating user generated content. In this study we investigate inter-rater reliability measures of qualitative annotations for supervised learning. We work with previously annotated product reviews from Amazon where phrases related to sustainability are highlighted. We measure inter-rater reliability of the annotations using four variations of Krippendorff's U-alpha. Based on the results we propose suggestions to designers on measuring reliability of qualitative annotations for machine learning datasets.

Keywords

artificial intelligence (AI)big data analysis qualitative annotations design methods

Type: Article
Information: Proceedings of the Design Society: DESIGN Conference , Volume 1 , May 2020 , pp. 21 - 30

DOI: https://doi.org/10.1017/dsd.2020.153 [Opens in a new window]
Creative Commons: This is an Open Access article, distributed under the terms of the Creative Commons Attribution-NonCommercial-NoDerivatives licence (http://creativecommons.org/licenses/by-nc-nd/4.0/), which permits non-commercial re-use, distribution, and reproduction in any medium, provided the original work is unaltered and is properly cited. The written permission of Cambridge University Press must be obtained for commercial re-use or in order to create a derivative work.
Copyright: The Author(s), 2020. Published by Cambridge University Press

References

Card, D. et al. (2015), “The Media Frames Corpus: Annotations of Frames Across Issues”, Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Short Papers), Beijing, China, July 26-31, 2015, Association for Computational Linguistics, pp. 438–444. https://doi.org/10.3115/v1/P15-2072 CrossRef Google Scholar

Cohen, J. (1960), “A coefficient of agreement for nominal scales”, Educational and Psychological Measurement, Vol. 20 No. 1, pp. 37–46. https://doi.org/10.1177/001316446002000104 CrossRef Google Scholar

El Dehaibi, N., Goodman, N.D. and MacDonald, E.F. (2019), “Extracting customer perceptions of product sustainability from online reviews”, Journal of Mechanical Design, Vol. 141 No. 12, p. 121103. https://doi.org/10.1115/1.4044522 CrossRef Google Scholar

Fleiss, J.L. (1971), “Measuring nominal scale agreement among many raters”, Psychological Bulletin, Vol. 76 No. 5, pp. 378–382. https://doi.org/10.1037/h0031619 CrossRef Google Scholar

Goodman, J.K. and Paolacci, G. (2017), “Crowdsourcing Consumer Research”, Journal of Consumer Research, Vol. 44 No. 1, pp. 196–210. https://doi.org/10.1093/jcr/ucx047 Google Scholar

Gwet, K.L. (2014), Handbook of inter-rater reliability, Advanced Analytics, Gaithersburg.Google Scholar

Hallgren, K.A. (2012), “Computing Inter-Rater Reliability for Observational Data: An Overview and Tutorial”, Tutor Quant Methods Psychol, Vol. 8 No. 1, pp. 23–34.CrossRef Google Scholar PubMed

Jurafsky, D. and Martin, J.H. (2017), “Naïve Bayes and sentiment classification”, Speech and language processing , Stanford University.Google Scholar

Kennedy, L. et al. (2019), “Evaluation of a mindfulness-based stress management and nutrition education program for mothers”, Cogent Social Sciences, Vol. 5 No. 1, pp. 1–12. https://doi.org/10.1080/23311886.2019.1682928 CrossRef Google Scholar

Krippendorff, K. (2004), “Measuring the reliability of qualitative text analysis data”, Quality and Quantity, Vol. 38 No. 6, pp. 787–800. https://doi.org/10.1007/s11135-004-8107-7 CrossRef Google Scholar

Krippendorff, K. (2018), “Reliability”, In: Accomazzo, T., Helton, E., Olson, A. and Ponce, M. (Eds.), Content analysis, Sage, Thousand Oaks, pp. 277–360.Google Scholar

Lai, V.K., Li, J.C. and Lee, A. (2019), “Psychometric validation of the Chinese patient- and family satisfaction in the intensive care unit questionnaires”, Journal of Critical Care, Vol. 54 No. December 2019, pp. 58–64. https://doi.org/10.1016/j.jcrc.2019.07.009 CrossRef Google Scholar PubMed

Liang, Y. et al. (2019), “Using social media to discover unwanted behaviours displayed by visitors to nature parks: comparisons of nationally and privately owned parks in the Greater Kruger National Park, South Africa”, Tourism Recreation Research. https://doi.org/10.1080/02508281.2019.1681720 CrossRef Google Scholar

Paolacci, G. and Chandler, J. (2014), “Inside the Turk: Understanding Mechanical Turk as a Participant Pool”, Current Directions in Psychology Research, Vol. 23 No. 3, pp. 184–188. https://doi.org/10.1177/0963721414531598 CrossRef Google Scholar

Rash, J.A. et al. (2019), “Assessing the efficacy of a manual-based intervention for improving the detection of facial pain expression”, European Journal of Pain, Vol. 23 No. 5, pp. 1006–1019. https://doi.org/10.1002/ejp.1369 Google Scholar PubMed

Stab, C. and Gurevych, I. (2014), “Identifying argumentative discourse structures in persuasive essays”, Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), Doha, Qatar, October 25-29, 2019, Association for Computational Linguistics, pp. 46–56. https://doi.org/10.3115/v1/D14-1006 CrossRef Google Scholar

Stone, T. and Choi, S.K. (2013), “Extracting consumer preference from user-generated content sources using classification”, Proceedings of the ASME 2013 International Design Engineering Technical Conferences and Computers and Information in Engineering Conference, Portland, OR, August 4-7, 2013, Association of Mechanical Engineers, pp. 1–9. https://doi.org/10.1115/DETC2013-13228 CrossRef Google Scholar

Toh, C.A., Miller, S.R. and Kremer, G.E. (2014), “The Impact of Team-Based Product Dissection on Design Novelty”, Journal of Mechanical Design, Vol. 136 No 4, p. 041004. https://doi.org/10.1115/1.4026151 CrossRef Google Scholar

Tuarob, S. and Tucker, C.S. (2015), “Automated discovery of lead users and latent product features by mining large scale social media networks”, Journal of Mechanical Design, Vol. 137 No. 7, p. 071402. https://doi.org/10.1115/1.4030049 CrossRef Google Scholar

Article contents

INVESTIGATING INTER-RATER RELIABILITY OF QUALITATIVE TEXT ANNOTATIONS IN MACHINE LEARNING DATASETS

Abstract

Keywords

References

Save article to Kindle

Save article to Dropbox

Save article to Google Drive

Reply to: Submit a response

Your details

You have entered the maximum number of contributors

Conflicting interests