Measures of Agreement with Multiple Raters: Fréchet Variances and Inference

Jonas Moss

doi:10.1007/s11336-023-09945-2

Measures of Agreement with Multiple Raters: Fréchet Variances and Inference

Published online by Cambridge University Press: 27 December 2024

Jonas Moss

Show author details

Jonas Moss*: Affiliation:
BI Norwegian Business School
*: Correspondence should be made to JonasMoss, Department of Data Science and Analytics, BI Norwegian Business School, Oslo, Norway. Email: [email protected]

Article contents

Abstract
Introduction
Measures of Agreement
Definition 1
Sample Estimates
Fréchet Variances for g-Wise Agreement Coefficients
Example 1
Example 2
Inference
Confidence Intervals
Concluding Remarks
Funding
Footnotes
References

Rights & Permissions

Abstract

Most measures of agreement are chance-corrected. They differ in three dimensions: their definition of chance agreement, their choice of disagreement function, and how they handle multiple raters. Chance agreement is usually defined in a pairwise manner, following either Cohen’s kappa or Fleiss’s kappa. The disagreement function is usually a nominal, quadratic, or absolute value function. But how to handle multiple raters is contentious, with the main contenders being Fleiss’s kappa, Conger’s kappa, and Hubert’s kappa, the variant of Fleiss’s kappa where agreement is said to occur only if every rater agrees. More generally, multi-rater agreement coefficients can be defined in a g-wise way, where the disagreement weighting function uses g raters instead of two. This paper contains two main contributions. (a) We propose using Fréchet variances to handle the case of multiple raters. The Fréchet variances are intuitive disagreement measures and turn out to generalize the nominal, quadratic, and absolute value functions to the case of more than two raters. (b) We derive the limit theory of g-wise weighted agreement coefficients, with chance agreement of the Cohen-type or Fleiss-type, for the case where every item is rated by the same number of raters. Trying out three confidence interval constructions, we end up recommending calculating confidence intervals using the arcsine transform or the Fisher transform.

Keywords

agreement inter-rater reliability AC1 Cohen kappa

Type: Original Research
Information: Psychometrika , Volume 89 , Issue 2 , June 2024 , pp. 517 - 541

DOI: https://doi.org/10.1007/s11336-023-09945-2 [Opens in a new window]
Creative Commons: This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
Copyright: Copyright © 2024 The Author(s)

1. Introduction

The most popular measures of inter-rater agreement involve correction for chance agreement. These can be written on the form

(1.1)

\begin{matrix} \frac{p_{a} - p_{ca}}{1 - p_{ca}} = 1 - \frac{p_{d}}{p_{cd}}, \end{matrix}

where $p_{a}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$p_{a}$$\end{document} ( $p_{d}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$p_{d}$$\end{document} ) is the percentage agreement (disagreement) between the raters and $p_{ca}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$p_{ca}$$\end{document} ( $p_{cd}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$p_{cd}$$\end{document} ) is the chance agreement (disagreement) between the raters. Such measures are frequently called chance-corrected measures of agreement. Well-known examples of coefficients in this class are Cohen’s (Reference Cohen1960) kappa and its weighted variant (Reference Cohen1968), its multi-rater variant Conger’s kappa (Conger, Reference Conger1980; Light, Reference Light1971), Krippendorff’s (Reference Krippendorff1970) alpha, Scott’s (Reference Scott1955) pi, and Fleiss’ (Reference Fleiss1971) kappa. Some of these coefficients are defined only for two raters. The rest are defined in a pairwise manner, in the sense that they measure agreement between two raters at a time. However, not every proposed measure of agreement is defined on pairs of raters. The most famous is Hubert’s kappa (Reference Hubert1977), which was recently studied in detail by Martín Andrés and Álvarez Hernández (Reference Martín Andrés and Álvarez Hernández2020). Other agreement coefficients include the $A C_{1}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$AC_{1}$$\end{document} coefficient (Gwet, Reference Gwet2008), the recent coefficient of van Oest (Reference van Oest2019), and a multitude of intraclass correlation coefficients (Gwet, Reference Gwet2014).

There is no consensus on how multi-rater agreement coefficients should be defined. Broadly speaking, two options are considered: pairwise coefficients and consensus coefficients. The pairwise coefficients measure the agreement between pairs of raters (Conger, Reference Conger1980), while the consensus coefficients measure the simultaneous agreement between all raters. In particular, consensus coefficients support the notion that “agreement occurs if and only if all raters agree on the categorization of an object” (Hubert, Reference Hubert1977). Both pairwise and consensus-based definitions of agreement are variants of g-wise measures of agreement (Conger, Reference Conger1980), where agreement is measured among g-tuples of raters. The case where $2 < g < R$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$2<g<R$$\end{document} has received little attention in the literature (Warrens, Reference Warrens2012), and non-trivial ways to measure agreement are hard to invent in this case. However, we introduce a promising and general framework for handling g-wise measures of agreement based on the concept of Fréchet variances (Dubey and Müller, Reference Dubey and Müller2019). The Fréchet variances generalize the variance and the measures of agreement based on them generalize the nominal, linearly weighted, and quadratically weighted pairwise measures of agreement in a natural way. They are easily interpretable, as you measure how much the raters disagree with the generalized mean rater and then adjust for chance. For nominal data in particular, they measure how many raters disagree with the modal rater, with a resulting agreement measure less extreme than Hubert’s kappa.

We need inferential theory for the g-wise agreement coefficients to make them useful. Much work has been done on inference for agreement coefficients, but, to our knowledge, inference for g-wise agreement coefficients has yet to be studied. Assuming multivariate normality of the ratings, Lin (Reference Lin1989, Section 3) derived the asymptotic distribution of Cohen’s kappa with quadratic weights. Fleiss (Reference Fleiss1971) introduced a formula for the standard error of Fleiss’s kappa, but later showed that it was incorrect. Using the properties of the multinomial distribution and the delta method, Schouten (Reference Schouten1980) found the asymptotic variance of the weighted Fleiss’s kappa in the case when the number of categories is finite. Almost forty years later, Gwet (Reference Gwet2021) found a consistent estimator of the variance for the unweighted Fleiss’s kappa. We extend these results to the weighted g-wise Fleiss’s kappa for any number of categories below. In addition, we mention that bootstrap inference for Fleiss’s kappa and Krippendorff’s alpha was studied by Zapf et al. (Reference Zapf, Castell, Morawietz and Karch2016).

We begin the paper by providing the definitions of two kinds of chance-corrected agreement coefficients. Then, in Sect. 2, we establish connections between the multi-rater Cohen’s kappa, Fleiss’s kappa, Conger’s kappa, Krippendorff’s alpha, and Hubert’s kappa. We restrict ourselves to the context where every rater rates every item. In Sect. 3, we discuss the Fréchet variances mentioned above. Then we spell out the basic limit theory for this class agreement coefficients in Sect. 4, extending the results of Schouten (Reference Schouten1980), Schouten (Reference Schouten1982), and O’Connell and Dobson (Reference O’Connell and Dobson1984) to vector-valued items and g-wise coefficients. We do this using the theory of U-statistics (Lee, Reference Lee2019), but there are other ways to arrive at the same results. Then, in Sect. 5, we provide practical recommendations regarding the choice of confidence interval, obtained by comparing three confidence interval constructions: basic, arcsine transformed, and Fisher transformed. Using a simulation study, we find that the arcsine and Fisher intervals outperform the basic interval when n is small.

2. Measures of Agreement

Let $d (x_{1}, \dots, x_{g})$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$d(x_{1},\ldots ,x_{g})$$\end{document} be a disagreement function, a positive and symmetric function of g arguments that equals 0 when all $x_{i}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$x_{i}$$\end{document} s are equal, i.e., $d (x, \dots, x) = 0$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$d(x,\ldots ,x)=0$$\end{document} . The disagreement function quantifies the disagreement between the ratings $x_{1}, \dots, x_{g}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$x_{1},\ldots ,x_{g}$$\end{document} , where 0 is understood as complete agreement.

Most disagreement functions take two arguments. While there are infinitely many disagreement functions, the best-known belong to the class of $l_{p}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$l_{p}$$\end{document} quasi-norms, $p = 0, 1, 2$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$p=0,1,2$$\end{document} , potentially raised to the pth power. The $l_{p}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$l_{p}$$\end{document} quasi-norms, $p \in [0, \infty]$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$p\in [0,\infty ]$$\end{document} in $R^{k}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\mathbb {R}^{k}$$\end{document} are defined as

(2.1)

\begin{matrix} {‖ x ‖}_{p} = {(\sum_{i = 1}^{k}, {| x_{i} |}^{p})}^{1 / p} . \end{matrix}

Here ${| | x | |}_{0} = \sum_{i = 1}^{k} 1 [x_{i} \neq 0]$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$||x||_{0}=\sum _{i=1}^{k}1[x_{i}\ne 0]$$\end{document} and ${| | x | |}_{\infty} = {sup}_{i} | x_{i} |$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$||x||_{\infty }=\sup _i |x_{i}|$$\end{document} , as can be verified by taking the limit of ${| | x | |}_{p}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$||x||_{p}$$\end{document} as $p \to 0$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$p\rightarrow 0$$\end{document} and $p \to \infty$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$p\rightarrow \infty $$\end{document} , respectively. It is well known that ${| | x | |}_{p}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$||x||_{p}$$\end{document} are proper norms if and only if $p \geq 1$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$p\ge 1$$\end{document} , as the triangle inequality is violated when $1 > p \geq 0$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$1>p\ge 0$$\end{document} .

Now define the disagreement functions $d_{p}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$d_{p}$$\end{document} as the $l_{p}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$l_{p}$$\end{document} quasi-norm evaluated in $x_{1} - x_{2}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$x_{1}-x_{2}$$\end{document} , i.e.,

(2.2)

\begin{matrix} d_{p} (x_{1}, x_{2}) = | | x_{1} - x_{2} {| |}_{p} . \end{matrix}

In the case of scalar values, $d_{0} (x_{1}, x_{2}) = 1 [x_{1} \neq x_{2}]$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$d_{0}(x_{1},x_{2})=1[x_{1}\ne x_{2}]$$\end{document} is known as the nominal disagreement function. For $p = 1$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$p=1$$\end{document} , the $l_{p}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$l_{p}$$\end{document} norm equals $d_{1} (x_{1}, x_{2}) = | x_{1} - x_{2} |$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$d_{1}(x_{1},x_{2})=|x_{1}-x_{2}|$$\end{document} , which is known as the absolute value disagreement function (and sometimes the linear disagreement function). The quadratic disagreement function is $d_{2}^{2} (x_{1}, x_{2}) = {(x_{1} - x_{2})}^{2}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$d_{2}^{2}(x_{1},x_{2})=(x_{1}-x_{2})^{2}$$\end{document} . Vector-valued variants of $d_{p}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$d_{p}$$\end{document} and $d_{p}^{p}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$d_{p}^{p}$$\end{document} are much less common, but have been used by, e.g., Berry et al. (Reference Berry, Johnston and Mielke2008).

When the dimension of the disagreement function d is not equal to 2, we are mostly interested in the case where its dimension equals the number of raters R. In this case, the disagreement functions often measure the degree of consensus among the raters, with 0 reflecting complete consensus. The most obvious choice is the Hubert disagreement function,

(2.3)

\begin{matrix} d (x_{1}, \dots, x_{g}) = 1 - 1 [x_{1} = \dots = x_{g}] \end{matrix}

which equals 0 if and only if every rater agrees on a rating. The disagreement function is employed in Hubert’s kappa (Hubert, Reference Hubert1977).

We present our results in terms of disagreement functions instead of the more popular agreement functions (i.e., positive symmetric functions bounded by 1 where 1 signifies maximal agreement, sometimes with the additional assumption that $a \geq 0$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$a\ge 0$$\end{document} ). We do this mainly for mathematical convenience. Agreement functions and disagreement functions are closely related, for if a is an agreement function, then $d = 1 - a$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$d=1-a$$\end{document} is a disagreement function. Our results could have been framed in terms of agreement functions instead, though with some loss of generality. See Appendix (Sect. 6) for a short discussion.

Our results and definitions are framed in the following setup. Let R be the number of raters and n be the number of items rated. Moreover, let F be a fixed multivariate distribution function F so that all rating vectors $X_{i}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\varvec{X}_{i}$$\end{document} are sampled independently from F. In symbols,

(2.4)

\begin{matrix} X_{1}, X_{2}, \dots, X_{n} \overset{iid}{\sim} F . \end{matrix}

There are no restrictions on the rating vector components $X_{ir}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\varvec{X}_{ir}$$\end{document} . They can be, e.g., categorical, real numbers, or vectors.

Equation (2.4) implies that every item is rated by exactly the same number of raters, which we refer to as the rectangular design assumption. The assumption is common in the literature,Footnote 1 but far from universal. It can be relaxed, but it is strictly required for the limit results. We sketch how to loosen it in Appendix (Sect. 6), but we have made no attempts at an inferential theory for non-rectangular designs.

There are two important special cases covered by equation (2.4). First, in the case of fixed raters, the same set of ordered raters rate every item. Having fixed raters is common in applications of Cohen’s kappa, Conger’s kappa, and the concordance correlation coefficient.Footnote 2 Having fixed raters ensures that F does not vary across different rating vectors, but F could potentially vary with the ratings when the raters are not fixed, provided we do not make further assumptions. And that leads us to the second case, that of exchangeable ratings given the item. Here, the rater identities do not affect the ratings given. The raters may be different for each item, but the distribution F will still be fixed. Exchangeable ratings occur when the ratings are identically distributed conditional on the item rated. Exchangeable ratings is an implicit assumption underlying most applications of Fleiss’ kappa, e.g., that of Fleiss (Reference Fleiss1971). In this case, the marginal distributions for all raters will be equal, which implies that the population value of the generalized Fleiss kappa equals the population value of the generalized Cohen’s kappa, both defined below. However, the sample Fleiss’s kappa is the preferred sample estimator, as it is invariant under changes of the raters’ identities.

We intend to collect the kappas of Cohen, Fleiss, Conger, Hubert, and so on, into a coherent framework of g-wise agreement coefficients. To do this, we will have to define some quantities. Let $x_{i} = (x_{i 1}, x_{i 2}, \dots, x_{iR})$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\varvec{x}_{i}=(x_{i1},x_{i2},\ldots ,x_{iR})$$\end{document} be an R-dimensional vector of observed ratings, and recall that g is the dimension of our disagreement function d. The following definitions are natural population counterparts of sample definitions prevalent in the agreement literature.

(i) The disagreement at $x_{1}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\varvec{x}_{1}$$\end{document} , as measured by d. The purpose of this quantity is to translate an arbitrary g-dimensional disagreement function d into a disagreement function taking an R-dimensional vector $x_{1}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\varvec{x}_{1}$$\end{document} as input. It is defined as
(2.5) $\begin{matrix} D_{d} (x_{1}) = {(\begin{matrix} R \\ g \end{matrix})}^{- 1} \sum_{r_{1}, \dots, r_{g}} d (x_{1 r_{1}}, \dots, x_{1 r_{g}}), \end{matrix}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\begin{aligned} D_{d}(\varvec{x}_{1})=\left( {\begin{array}{c}R\\ g\end{array}}\right) ^{-1} \sum _{r_{1},\ldots ,r_{g}}d(\varvec{x}_{1r_{1}},\ldots , \varvec{x}_{1r_{g}}), \end{aligned}$$\end{document}
where the sum runs over all g-dimensional subsets of ${1, \dots, R}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\{1,\ldots ,R\}$$\end{document} with order ignored, i.e., the g-combinations of R. The expression is simplified when $g = R$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$g=R$$\end{document} , as $D_{d} (x_{1}) = d (x_{11}, \dots, x_{1 R})$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$D_{d}(\varvec{x}_{1})=d(\varvec{x}_{11},\ldots ,\varvec{x}_{1R})$$\end{document} in this case. To gain some intuition about this quantity, suppose that $g = 2$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$g=2$$\end{document} , that $x_{1}, x_{2}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$x_{1},x_{2}$$\end{document} are scalars, and consider the nominal disagreement function $d_{0} (x_{1}, x_{2}) = 1 [x_{1} \neq x_{2}]$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$d_{0}(x_{1},x_{2})=1[x_{1}\ne x_{2}]$$\end{document} . Then $D_{d} (x_{1}) = 2 R^{- 1} {(R - 1)}^{- 1} \sum_{r_{1} > r_{2}} 1 [x_{1 r_{1}} \neq x_{1 r_{2}}]$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$D_{d}(\varvec{x}_{1})=2R^{-1}(R-1)^{-1}\sum _{r_{1}>r_{2}}1[x_{1r_{1}}\ne x_{1r_{2}}]$$\end{document} is the percentage of times two distinct raters disagree on their rating.
(ii) The Cohen-type chance disagreement at $x_{1}, \dots, x_{g}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\varvec{x}_{1},\ldots ,\varvec{x}_{g}$$\end{document} , so called to differentiate it from the Fleiss-type chance disagreement. It is similar to the disagreement at $x_{1}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\varvec{x}_{1}$$\end{document} , but this time the raters do not necessarily rate the same item, as one rater rates the first item (from $x_{1}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\varvec{x}_{1}$$\end{document} ) another rater rates the second item (from $x_{2}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\varvec{x}_{2}$$\end{document} ), and so on. We do not allow a rater to rate the same item more than once in a pass: Hence, we need to choose g raters from a set of R raters, and the chance disagreement is
(2.6) $\begin{matrix} C_{d} (x_{1}, \dots, x_{g}) = {(\begin{matrix} R \\ g \end{matrix})}^{- 1} \sum_{r_{1}, \dots, r_{g}} d (x_{1 r_{1}}, \dots, x_{g r_{g}}), \end{matrix}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\begin{aligned} C_{d}(\varvec{x}_{1},\ldots ,\varvec{x}_{g}) =\left( {\begin{array}{c}R\\ g\end{array}}\right) ^{-1}\sum _{r_{1},\ldots ,r_{g}}d(\varvec{x}_{1r_{1}}, \ldots ,\varvec{x}_{gr_{g}}), \end{aligned}$$\end{document}
where the sum runs over all g-dimensional subsets of ${1, \dots, R}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\{1,\ldots ,R\}$$\end{document} , i.e., the g-combinations of R. Observe that $D_{d} (x) = C_{d} (x,, \dots, x)$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$D_{d}(\varvec{x})=C_{d}(\varvec{x},,\ldots ,\varvec{x})$$\end{document} . Since d is assumed to be symmetric, the expression is simplified to $d (x_{1 r_{1}}, \dots, x_{R r_{R}})$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$d(\varvec{x}_{1r_{1}},\ldots ,\varvec{x}_{Rr_{R}})$$\end{document} when $g = R$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$g=R$$\end{document} . When $g = 2$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$g=2$$\end{document} , $C_{d} (x_{1}, x_{2}) = R^{- 1} {(R - 1)}^{- 1} \sum_{r_{1} \neq r_{2}} d (x_{1 r_{1}}, x_{2 r_{2}})$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$C_{d}(\varvec{x}_{1},\varvec{x}_{2})=R^{-1}(R-1)^{-1} \sum _{r_{1}\ne r_{2}}d(\varvec{x}_{1r_{1}},\varvec{x}_{2r_{2}})$$\end{document} .
(iii) The Fleiss-type chance disagreement at $x_{1}, \dots, x_{g}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\varvec{x}_{1},\ldots ,\varvec{x}_{g}$$\end{document} is similar, but allows the same rater to rate an item multiple times. Its definition is
(2.7) $\begin{matrix} F_{d} (x_{1}, \dots, x_{g}) = R^{- g} \sum_{r_{1}, \dots, r_{g}} d (x_{1 r_{1}}, \dots, x_{g r_{g}}), \end{matrix}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\begin{aligned} F_{d}(\varvec{x}_{1},\ldots ,\varvec{x}_{g}) =R^{-g}\sum _{r_{1},\ldots ,r_{g}}d(\varvec{x}_{1r_{1}}, \ldots ,\varvec{x}_{gr_{g}}), \end{aligned}$$\end{document}
where the sum runs over the product set $R^{g}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$R^{g}$$\end{document} . The expression for $F_{d} (x_{1}, \dots, x_{g})$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$F_{d}(\varvec{x}_{1},\ldots ,\varvec{x}_{g})$$\end{document} is not dramatically simplified when $g = R$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$g=R$$\end{document} . When $g = 2$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$g=2$$\end{document} , $F_{d} (x_{1}, x_{2}) = R^{- 2} \sum_{r_{1}, r_{2}} d (x_{1 r_{1}}, x_{2 r_{2}})$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$F_{d}(\varvec{x}_{1},\varvec{x}_{2})=R^{-2} \sum _{r_{1},r_{2}}d(\varvec{x}_{1r_{1}},\varvec{x}_{2r_{2}})$$\end{document} .

We will call the expected values of these quantities the mean disagreement, the mean Cohen-type chance disagreement, and the mean Fleiss-type chance disagreement. Slightly abusing notation, we denote them as

(2.8)

\begin{matrix} D_{d} = E [D_{d} (X_{1})], C_{d} = E [C_{d} (X_{1}, \dots, X_{g})], F_{d} = E [F_{d} (X_{1}, \dots, X_{g})], \end{matrix}

where $X_{1}, \dots, X_{g}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\varvec{X}_{1},\ldots ,\varvec{X}_{g}$$\end{document} are independently sampled from the same distribution F. Discussions about the difference between $E [C_{d} (X_{1}, \dots, X_{g})]$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$E[C_{d}(\varvec{X}_{1},\ldots ,\varvec{X}_{g})]$$\end{document} and $E [F_{d} (X_{1}, \dots, X_{g})]$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$E[F_{d}(\varvec{X}_{1},\ldots ,\varvec{X}_{g})]$$\end{document} , and why to prefer one over the other, are abundant in the literature, often in the context of the so-called paradox of kappa (Cicchetti and Feinstein, Reference Cicchetti and Feinstein1990).

Definition 1

Let $X \sim F$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$X\sim F$$\end{document} be a vector of R ratings and d be an agreement function with dimension g. Define the population values of the generalized Cohen’s kappa $(κ_{d})$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$(\kappa _{d})$$\end{document} and Fleiss’s kappa $(π_{d})$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$(\pi _{d})$$\end{document} as

(2.9)

\begin{matrix} κ_{d} = 1 - \frac{D_{d}}{C_{d}}, π_{d} = 1 - \frac{D_{d}}{F_{d}} . \end{matrix}

The generalized Fleiss’s kappa, denoted as $π_{d}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\pi _{d}$$\end{document} since it generalizes of Scott’s pi (Scott, Reference Scott1955), is a straightforward generalization of the Fleiss kappa (Reference Fleiss1971) to hold for $2 < g \leq R$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$2< g\le R$$\end{document} . When $g = R$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$g=R$$\end{document} and d is the nominal disagreement, it equals Hubert’s kappa. Likewise, the generalized Cohen’s kappa is an extension of weighted Conger’s kappa to hold for $2 \leq g \leq R$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$2\le g\le R$$\end{document} . When $g = R$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$g=R$$\end{document} , it equals the Schuster–Smith coefficient (Schuster & Smith, Reference Schuster and Smith2005, eq. 1).Footnote 3 It generalizes several other agreement coefficients as well. For instance, Berry and Mielke (Reference Berry and Mielke1988) discussed what we call $κ_{d}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\kappa _{d}$$\end{document} for Euclidean weights between vector-valued ratings, while Janson and Olsson (Reference Janson and Olsson2001) extended it to squared Euclidean and nominal weights. The relationship between most of the mentioned agreement coefficients is summarized in Table 1.

Table 1 Weighted agreement coefficients.

*Lin’s concordance coefficient and the concordance correlation coefficient (CC) is defined for quadratic weights only.

3. Sample Estimates

Let $X_{1}, \dots, X_{n} \sim F$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\varvec{X}_{1},\ldots ,\varvec{X}_{n}\sim F$$\end{document} be n iid vectors of ratings. Then there is a single natural sample estimator of $D_{d}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$D_{d}$$\end{document} , namely

(2.10)

\begin{matrix} {\hat{D}}_{d} = n^{- 1} \sum_{i = 1}^{n} D_{d} (x_{i}) . \end{matrix}

There are, however, two natural estimators of the Cohen-type chance disagreement: one them a V-statistic (Lee, Reference Lee2019, Chapter 4.2) and the other a U-statistic (Lee, Reference Lee2019, Chapter 1),

(2.11)

\begin{matrix} {\hat{C}}_{d} = n^{- g} \sum_{i_{1}, \dots, i_{g}} C_{d} (x_{i_{1}}, \dots, x_{i_{g}}) and {\hat{C}}_{d}^{u} = {(\begin{matrix} n \\ g \end{matrix})}^{- 1} \sum_{i_{1}, \dots, i_{g}} C_{d} (x_{i_{1}}, \dots, x_{i_{g}}), \end{matrix}

where the first estimator runs over all combinations with repetitions of $i_{1}, i_{2}, \dots, i_{g}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$i_{1},i_{2},\ldots ,i_{g}$$\end{document} and the second estimator runs over the unordered combinations $i_{1} < i_{2} < \dots < i_{g}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$i_{1}<i_{2}<\ldots <i_{g}$$\end{document} . Using the basic results of U-statistics (Lee, Reference Lee2019, Chapter 1), we see that $C_{d}^{u}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$C_{d}^{u}$$\end{document} is the unique minimum-variance unbiased estimator of $C_{d}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$C_{d}$$\end{document} , which makes it attractive from a theoretical point of view. However, from a well-known correspondence between U-statistics and V-statistics, the asymptotic distributions of ${\hat{C}}_{d}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\hat{C}_{d}$$\end{document} coincide with the asymptotic distribution of ${\hat{C}}_{d}^{u}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\hat{C}_{d}^{u}$$\end{document} (Lee, Reference Lee2019, Chapter 4, Theorem 1), so the choice between ${\hat{C}}_{d}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\hat{C}_{d}$$\end{document} and ${\hat{C}}_{d}^{u}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\hat{C}_{d}^{u}$$\end{document} barely matters when n is sufficiently large.

Likewise, there are two natural estimators of the Fleiss-type weighted chance agreement,

(2.12)

\begin{matrix} {\hat{F}}_{d} = n^{- g} \sum_{i_{1}, \dots, i_{g}} F_{d} (x_{i_{1}}, \dots, x_{i_{g}}) and {\hat{F}}_{d}^{u} = {(\begin{matrix} n \\ g \end{matrix})}^{- 1} \sum_{i_{1}, \dots, i_{g}} F_{d} (x_{i_{1}}, \dots, x_{i_{g}}), \end{matrix}

where the index sets are described above.

Now, we can define two sample variants of Cohen’s kappa (Fleiss’s kappa), depending on which one of ${\hat{C}}_{d}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\hat{C}_{d}$$\end{document} ( ${\hat{F}}_{d}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\hat{F}_{d}$$\end{document} ) and ${\hat{C}}_{d}^{u}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\hat{C}_{d}^{u}$$\end{document} ( ${\hat{F}}_{d}^{u}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\hat{F}_{d}^{u}$$\end{document} ) we choose to use. These are ${\hat{κ}}_{d} = 1 - {\hat{D}}_{d} / {\hat{C}}_{d}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\hat{\kappa }_{d}=1-\hat{D}_{d}/\hat{C}_{d}$$\end{document} and ${\hat{κ}}_{d}^{u} = 1 - {\hat{D}}_{d} / {\hat{C}}_{d}^{u}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\hat{\kappa }_{d}^{u}=1-\hat{D}_{d}/\hat{C}_{d}^{u}$$\end{document} for Cohen’s kappa and ${\hat{π}}_{d} = 1 - {\hat{D}}_{d} / {\hat{F}}_{d}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\hat{\pi }_{d}=1-\hat{D}_{d}/\hat{F}_{d}$$\end{document} and ${\hat{π}}_{d}^{u} = 1 - {\hat{D}}_{d} / {\hat{F}}_{d}^{u}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\hat{\pi }_{d}^{u}=1-\hat{D}_{d}/\hat{F}_{d}^{u}$$\end{document} for Fleiss’s kappa. The definition of the sample Cohen’s kappa (Cohen, Reference Cohen1968) agrees with ${\hat{κ}}_{d}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\hat{\kappa }_{d}$$\end{document} , not with ${\hat{κ}}_{d}^{u}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\hat{\kappa }_{d}^{u}$$\end{document} . Likewise, the sample Fleiss’s kappa has a definition agreeing with ${\hat{π}}_{d}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\hat{\pi }_{d}$$\end{document} (Fleiss, Reference Fleiss1971). Moreover, due to the possibility of binning data, ${\hat{π}}_{d}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\hat{\pi }_{d}$$\end{document} and ${\hat{κ}}_{d}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\hat{\kappa }_{d}$$\end{document} are faster to compute when the data is not continuous. Since the estimators are asymptotically equivalent in any case, we will stick to the V-statistics ${\hat{κ}}_{d}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\hat{\kappa }_{d}$$\end{document} and ${\hat{π}}_{d}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\hat{\pi }_{d}$$\end{document} for estimation, but use the U-statistic form when deriving limit distributions. We note that, since we need to compute strictly fewer combinations, ${\hat{κ}}_{d}^{u}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\hat{\kappa }_{d}^{u}$$\end{document} and ${\hat{π}}_{d}^{u}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\hat{\pi }_{d}^{u}$$\end{document} are faster to compute when the data is continuous, which may be useful in some settings.

3. Fréchet Variances for g-Wise Agreement Coefficients

The most popular measures of agreement are defined only for $g = 2$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$g=2$$\end{document} . It is easy to find reasonable disagreement measures in this case, as one can draw on the extensive literature on norms and distances. The $l_{p}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$l_{p}$$\end{document} distances are the obvious choices, but there are many unexplored options, such as the Huber loss (Huber, Reference Huber1964) and the LINEX loss (Varian, Reference Varian1975).

In the setting of Hubert’s kappa and the Schuster–Smith coefficient, we have $g = R$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$g=R$$\end{document} , and it is not that easy to find reasonable disagreement functions anymore. The disagreement function used in Hubert’s kappa, $d (x_{1}, \dots, x_{R}) = 1 - 1 [x_{1} = \dots = x_{R}]$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$d(x_{1},\ldots ,x_{R})=1-1[x_{1}=\cdots =x_{R}]$$\end{document} , will penalize any number of discordant ratings equally, yielding the often undesirable outcome that most sets of ratings will be in complete disagreement. But there are less sensitive ways to count nominal disagreements. Consider the case of 10 raters with three ratings on an ordinal scale from 1–3, with 7 raters giving rating 1, 2 giving rating 2, and 1 giving rating 3. Then Hubert’s disagreement rating is 1, as the rating vector is not constant, and the pairwise disagreement is 46/100. But it sounds reasonable to pick the modal rating (in this case 1) and then report the number of raters that disagree with it, divided by the number of raters. In this case, the number of raters disagreeing with the modal rating is 3, and the “modal” disagreement equals 3/10.

Sometimes we wish to aggregate numerical ratings instead of categorical ratings. Consider the above case again but with the median (which is 1) instead of the mode. It is well known that the median of a vector x is equal to ${argmin}_{μ} \frac{1}{R} \sum_{r = 1}^{R} | x_{r} - μ |$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${{\,\textrm{argmin}\,}}_{\mu }\frac{1}{R}\sum _{r=1}^R|x_{r}-\mu |$$\end{document} , so ${min}_{μ} \frac{1}{R} \sum_{r = 1}^{R} | x_{r} - μ |$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\min _{\mu }\frac{1}{R}\sum _{r=1}^R|x_{r}-\mu |$$\end{document} (mean absolute deviation from the median) appears to be a reasonable measure of the mean disagreement when we use the median as the aggregation method. The resulting mean disagreement of the previous example is ${min}_{μ} \frac{1}{R} \sum_{r = 1}^{R} | x_{r} - μ | = \frac{1}{10} \sum_{r = 1}^{10} | x_{r} - 1 | = 4 / 10$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\min _{\mu }\frac{1}{R}\sum _{r=1}^R|x_{r}-\mu |=\frac{1}{10} \sum _{r=1}^{10}|x_{r}-1|=4/10$$\end{document} .

The “modal” and “median” disagreement measures are instances of an intuitive generalization of the variance called the Fréchet variance (Dubey and Müller, Reference Dubey and Müller2019). Let l be a distance function satisfying $l (x, y) \geq 0$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$l(x,y)\ge 0$$\end{document} and $l (x, x) = 0$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$l(x,x)=0$$\end{document} , and let $A = {x_{1}, x_{2}, \dots, x_{R}}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$A=\{x_{1},x_{2},\ldots ,x_{R}\}$$\end{document} be a set of points. The sample Fréchet mean of A is defined as the (not necessarily unique) point $μ_{l}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\mu _{l}$$\end{document} that minimizes the sum of distances to all points in A, that is,Footnote 4

(3.1)

\begin{matrix} μ_{l} [A] = {argmin}_{μ} \sum_{r = 1}^{R} l (μ, x_{r}) . \end{matrix}

Similarly, the sample Fréchet variance on A with distance function l is

(3.2)

\begin{matrix} V (l) [A] = min_{μ} \sum_{r = 1}^{R} \frac{1}{R} l (μ, x_{r}) = \sum_{r = 1}^{R} \frac{1}{R} l (μ_{l} [A], x_{r}) . \end{matrix}

The Fréchet mean (Fréchet, Reference Fréchet1948) is a generalization of centroids to arbitrary distance functions l; likewise, the Fréchet variance is a generalization of dispersion to any such distance function. They are best understood through a decision-theoretic lens: The Fréchet mean of A represents your best guess of the true classification or value of an item according to the distance l; the Fréchet variance V(l) is the decision-theoretic risk associated with the choice. See Cooil and Rust (Reference Cooil and Rust1994) for an investigation of a closely related idea in the context of agreement measures.

Define the g-dimensional disagreement based on l as

(3.3)

\begin{matrix} d (x_{1}, \dots, x_{g}) = V (l) [{x_{1}, \dots, x_{g}}] . \end{matrix}

The most important distance functions are:

(i) $d_{0} (x, y) = 1 [x \neq y]$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$d_{0}(x,y)=1[x\ne y]$$\end{document} . Generalizes the nominal distance. If the data are categorical, the Fréchet mean $μ_{d}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\mu _{d}$$\end{document} equals the mode, and the Fréchet variance equals the percentage of observations different from the mode. If we are dealing with vector-valued data with I elements each, it might be preferable to use $I^{- 1} \sum_{i = 1}^{I} 1 [x_{i} \neq y_{i}]$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$I^{-1}\sum _{i=1}^{I}1[x_{i}\ne y_{i}]$$\end{document} instead, as it counts each dimension of the nominal data separately.
(ii) $d_{1} (x, y) = | | x - {y | |}_{1}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$d_{1}(x,y)=||x-y||_{1}$$\end{document} . For scalar ratings, the Fréchet mean is equal to the sample median. The Fréchet variance equals the sample mean absolute deviation from the median, i.e., $\frac{1}{R} \sum_{r = 1}^{R} | x_{r} - μ_{d} |$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\frac{1}{R}\sum _{r=1}^{R}|x_{r}-\mu _d|$$\end{document} , where $μ_{d}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\mu _d$$\end{document} is the sample median.
(iii) $d_{2}^{2} (x, y) = | | x - {y | |}_{2}^{2}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$d_{2}^{2}(x,y)=||x-y||_{2}^{2}$$\end{document} . For scalar ratings, the Fréchet mean is equal to the sample mean $μ_{d} = \frac{1}{R} \sum_{r = 1}^{R} x_{r}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\mu _d=\frac{1}{R}\sum _{r=1}^{R}x_{r}$$\end{document} , and the Fréchet variance is equal to the biased sample variance of ${x_{1}, x_{2}, \dots, x_{R}}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\{x_{1},x_{2},\ldots ,x_{R}\}$$\end{document} , that is, $\frac{1}{R} \sum_{r = 1}^{R} {(x_{r} - μ_{d})}^{2}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\frac{1}{R}\sum _{r=1}^{R}(x_{r}-\mu _d)^{2}$$\end{document} .
(iv) $d_{2} (x, y) = | | x - {y | |}_{2}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$d_{2}(x,y)=||x-y||_{2}$$\end{document} . For vector-valued data, the Fréchet mean has no simple formula, but is known as the geometric median. If the data is scalar, $d_{2} = d_{1}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$d_{2}=d_{1}$$\end{document} , which implies that the Fréchet mean equals the median, hence the name. There is an extensive literature on the geometric median; see, e.g., Drezner et al. (Reference Drezner, Klamroth, Schöbel and Wesolowsky2002) for an overview and Cohen et al. (Reference Cohen, Lee, Miller, Pachocki and Sidford2016) for how to compute it. When the ratings are vector-valued, the geometric median is far more computationally expensive than the Fréchet mean based on $| | x - {y | |}_{2}^{2}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$||x-y||_{2}^{2}$$\end{document} .

For any $p \in [0, \infty]$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$p\in [0,\infty ]$$\end{document} and pair of vectors $x_{1}, x_{2}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$x_{1},x_{2}$$\end{document} , we have the following (proved in Appendix, Sect. 6):

(3.4)

\begin{matrix} V (d_{p}) [x_{1}, x_{2}] = \frac{1}{2} d_{p} (x_{1}, x_{2}), V (d_{p}^{p}) [x_{1}, x_{2}] = \frac{1}{2^{p}} d_{p}^{p} (x_{1}, x_{2}) . \end{matrix}

It follows that $κ_{d_{p}} = κ_{V (d_{p})}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\kappa _{d_{p}}=\kappa _{V(d_{p})}$$\end{document} and $κ_{d_{p}^{p}} = κ_{V (d_{p}^{p})}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\kappa _{d_{p}^{p}}=\kappa _{V(d_{p}^{p})}$$\end{document} when we are dealing with pairwise agreement. Thus, the Fréchet variances generalize the pairwise agreement for these distances to g-wise coefficients. But be aware that the particular case of $V (d_{2}^{2})$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$V(d_{2}^{2})$$\end{document} constitutes a trivial generalization, as it can be shown that the kappas do not vary with g when using the quadratic Fréchet variance $V (d_{2}^{2})$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$V(d_{2}^{2})$$\end{document} . It follows that $κ_{V} (d_{2}^{2})$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\kappa _V(d_2^2)$$\end{document} equals the concordance coefficient for every g.

Example 1

Suppose you have $R = 5$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$R=5$$\end{document} raters and 4 items, with ratings (1, 1, 2, 1, 1), (1, 2, 3, 2, 2), (2, 1, 1, 1, 1), (2, 3, 4, 4, 5). The Fréchet means using the distance $| x - y |$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$|x-y|$$\end{document} equals the sample medians 1, 2, 1, 4. The Fréchet variances are $V (d_{1}) = (0.2, 0.4, 0.2, 0.8)$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$V(d_{1})=(0.2,0.4,0.2,0.8)$$\end{document} . To calculate the sample Cohen’s kappa with $d = V (d_{1})$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$d=V(d_{1})$$\end{document} , we first find the mean disagreement $\bar{V (d_{1})} = 0.4$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\overline{V(d_{1})}=0.4$$\end{document} (2.10), then the mean Cohen disagreement, which is $\approx 0.73$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\approx 0.73$$\end{document} (2.11). Thus, Cohen’s kappa is $1 - 0.4 / 0.73 = 0.45$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$1-0.4/0.73=0.45$$\end{document} .

We believe the most useful distance measures will typically be $d_{0}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$d_{0}$$\end{document} for categorical data and $d_{1}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$d_{1}$$\end{document} for ordinal data, both using $g = R$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$g=R$$\end{document} . The quadratic distance $d_{2}^{2}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$d_{2}^{2}$$\end{document} could be used for ordinal data as well, but is harder to justify, as it is usually not obvious why we would be interested in the squared distance between two observations rather than just the distance itself. The distances $d_{p}, p \in (1, \infty]$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$d_{p},p\in (1,\infty ]$$\end{document} , with $d_{2}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$d_{2}$$\end{document} included, are even harder to recommend, as they do not work in a coordinatewise manner for vector data. In any case, it seems most reasonable to go with the R-wise variants of these distance measures, as they make use of all the available information, but the g-wise agreement coefficients ( $g < R)$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$g<R)$$\end{document} do not.

Example 2

In the paper introducing what is now called Fleiss’s kappa, Fleiss (Reference Fleiss1971) discussed an example involving 5 different types of diagnoses, $n = 30$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$n=30$$\end{document} patients, and $R = 6$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$R=6$$\end{document} psychiatrists. The data were originally from Sandifer et al. (Reference Sandifer, Hordern, Timbury and Green1968), but Fleiss removed some ratings to make the design rectangular. We use this data to illustrate the difference between Hubert’s kappa and the Fréchet variances when applied to nominal data with $g = R$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$g=R$$\end{document} .

Hubert’s kappa is $π = 0.166$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\pi =0.166$$\end{document} while Fleiss’ kappa using $V (d_{0})$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$V(d_0)$$\end{document} is $π = 0.486$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\pi =0.486$$\end{document} . The substantial difference suggests that a sizeable number of rating vectors contain at least one rating that disagrees with the others. Table 2 summarizes the relevant aspects of the data. The maximal agreement row could potentially go from 1 to 6, but the smallest number of raters agreeing on the classification of an item in this data set is 3. The count row counts the number of rows with the corresponding maximal agreements and distances. According to the Hubert distance, the raters disagree a lot, since only 5 items have a disagreement of 0 and the rest a disagreement of 1. On the other hand, $V (d_{0})$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$V(d_{0})$$\end{document} results in a much smaller overall disagreement, with all disagreements smaller than the maximum of 1.

Table 2 Maximal agreement for the data of Fleiss (Reference Fleiss1971).

*The largest number of raters that agree on the classification of an item. Both $V (d_{0})$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$V(d_{0})$$\end{document} and Hubert’s distance depend only on this when $g = R$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$g=R$$\end{document} .

4. Inference

4.1. Limit Theory Using U-Statistics

Let $X_{1}, \dots, X_{n}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\varvec{X}_{1},\ldots ,\varvec{X}_{n}$$\end{document} be independently and identically distributed and $ψ (x_{1}, \dots, x_{k})$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\psi (x_{1},\ldots ,x_{k})$$\end{document} be a symmetric function. A U-statistic of order k with kernel $ψ$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\psi $$\end{document} is

(4.1)

\begin{matrix} U_{n} = {(\begin{matrix} n \\ k \end{matrix})}^{- 1} \sum_{i_{1}, \dots, i_{k}} ψ (X_{i_{1}}, \dots, X_{i_{k}}), \end{matrix}

where the sum extends over all k-dimensional tuples satisfying $1 \leq i_{1} < i_{2} < \dots \leq n$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$1\le i_{1}<i_{2}<\cdots \le n$$\end{document} .

The theory of U-statistics was established by Hoeffding (Reference Hoeffding1992); for an introduction, see, e.g., Chapter 6.1 of Lehmann (Reference Lehmann2004), Chapter 5 of Serfling (Reference Serfling1980), or the textbook of Lee (Reference Lee2019). These references handle U-statistics where the $X_{i}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$X_{i}$$\end{document} s are real-valued, but their results, including the simple results below, hold for vector-valued $X_{i}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$X_{i}$$\end{document} s as well (Korolyuk and Borovskich, Reference Korolyuk and Borovskich2013).

The weighted chance agreement of Fleiss-type (Cohen-type) is a U-statistic with kernel $F_{d}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$F_{d}$$\end{document} ( $C_{d})$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$C_{d})$$\end{document} , of order g. The disagreement is a U-statistic with kernel $D_{d}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$D_{d}$$\end{document} , which has order 1. To find the asymptotic variance of the kappas, we will use formulas for the asymptotic covariance of U-statistics. Let $U_{1 n}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$U_{1n}$$\end{document} and $U_{2 n}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$U_{2n}$$\end{document} be two U-statistics of n observations with symmetric kernel functions $ψ_{1}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\psi _{1}$$\end{document} , $ψ_{2}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\psi _{2}$$\end{document} of dimensions $k_{1}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$k_{1}$$\end{document} and $k_{2}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$k_{2}$$\end{document} . Define

\begin{matrix} σ_{1}^{2} = & Var (E [ψ_{1} (X_{1}, \dots, X_{k_{1}}) ∣ X_{1})]), \\ σ_{12} = & Cov (E [ψ_{1} (X_{1}, \dots, X_{k_{1}}) ∣ X_{1})], E [ψ_{2} (X_{1}, \dots, X_{k_{2}}) ∣ X_{1})]) . \end{matrix}

Then we have $n Cov (U_{1 n}, U_{2 n}) \to k_{1} k_{2} σ_{12}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$n{{\,\textrm{Cov}\,}}(U_{1n},U_{2n})\rightarrow k_{1}k_{2}\sigma _{12}$$\end{document} and $n Var (U_{1 n}) \to k_{1}^{2} σ_{1}^{2}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$n{{\,\textrm{Var}\,}}(U_{1n})\rightarrow k_{1}^{2}\sigma _{1}^{2}$$\end{document} (Lee, Reference Lee2019, Theorem 2, p. 76)). It is also possible to calculate the exact covariances, which could potentially make the asymptotic variances for the kappas perform better. See Appendix, Sect. 6 for the formula for the exact covariance (Lee, Reference Lee2019, Theorem 2, p. 17)).

Lemma 1

Define the parameter vectors $p = (D_{d}, C_{d}, F_{d})$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\varvec{p}=(D_{d},C_{d},F_{d})$$\end{document} and $\hat{p} = ({\hat{D}}_{d}, {\hat{C}}_{d}, {\hat{F}}_{d})$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\hat{\varvec{p}}=(\hat{D}_{d},\hat{C}_{d},\hat{F}_{d})$$\end{document} . Then $\sqrt{n} (\hat{p} - p) \overset{d}{\to} N (0, Σ)$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\sqrt{n}(\hat{\varvec{p}}-\varvec{p}){\mathop {\rightarrow }\limits ^{d}}N(0,\Sigma )$$\end{document} , where $Σ$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\Sigma $$\end{document} is the covariance matrix with elements

\begin{matrix} σ_{11} = & σ_{D}^{2} = & Var D_{d} (X_{1}) & , & σ_{12} = & σ_{CD} = & g Cov (μ_{dC} (X_{1}), D_{d} (X_{1})), \\ σ_{22} = & σ_{C}^{2} = & g^{2} Var μ_{dC} (X_{1}) & , & σ_{13} = & σ_{FD} = & g Cov (μ_{dF} (X_{1}), D_{d} (X_{1})), \\ σ_{33} = & σ_{F}^{2} = & g^{2} Var μ_{dF} (X_{1}) & , & σ_{23} = & σ_{CF} = & g Cov (μ_{dC} (X_{1}), μ_{dF} (X_{1})) . \end{matrix}

\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\begin{aligned} \sigma _{11}= & {} \sigma _{D}^{2}= & {} {{\,\textrm{Var}\,}}D_{d}(\varvec{X}_{1})&,\quad&\sigma _{12}= & {} \sigma _{CD}= & {} g{{\,\textrm{Cov}\,}}(\mu _{dC}(\varvec{X}_{1}),D_{d}(\varvec{X}_{1})),\\ \sigma _{22}= & {} \sigma _{C}^{2}= & {} g^{2}{{\,\textrm{Var}\,}}\mu _{dC}(\varvec{X}_{1})&,\quad&\sigma _{13}= & {} \sigma _{FD}= & {} g{{\,\textrm{Cov}\,}}(\mu _{dF}(\varvec{X}_{1}),D_{d}(\varvec{X}_{1})),\\ \sigma _{33}= & {} \sigma _{F}^{2}= & {} g^{2}{{\,\textrm{Var}\,}}\mu _{dF}(\varvec{X}_{1})&,\quad&\sigma _{23}= & {} \sigma _{CF}= & {} g{{\,\textrm{Cov}\,}}(\mu _{dC}(\varvec{X}_{1}),\mu _{dF}(\varvec{X}_{1})). \end{aligned}$$\end{document}

Here the variable $μ_{dC} (X_{1})$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\mu _{dC}(\varvec{X}_{1})$$\end{document} , and $μ_{dF} (X_{1})$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\mu _{dF}(\varvec{X}_{1})$$\end{document} are defined as

\begin{matrix} μ_{dC} (X_{1}) = E [C_{d} (X_{1}, \dots, X_{g}) ∣ X_{1}] μ_{dF} (X_{1}) = E [F_{d} (X_{1}, \dots, X_{g}) ∣ X_{1}] . \end{matrix}

The form of the covariance matrix follows from the remarks preceding the lemma. Asymptotic normality follows from a general theorem about asymptotic normality of U-statistics, see, e.g., Theorem 2 of Lee (Reference Lee2019, p. 76).

We want to use Lemma 1 to find the limit distribution of the generalized Cohen’s kappa and Fleiss’s kappa. To this end, recall the multivariate delta method (see, e.g., Lehmann, Reference Lehmann2004, Theorem 5.2.3). Let $f : R^{k} \to R$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$f:\mathbb {R}^{k}\rightarrow \mathbb {R}$$\end{document} be continuously differentiable at $θ$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\theta $$\end{document} and suppose that $\sqrt{n} (\hat{θ} - θ) \overset{d}{\to} N (0, Σ)$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\sqrt{n}(\hat{\theta }-\theta ){\mathop {\rightarrow }\limits ^{d}}N(0,\Sigma )$$\end{document} . Then

(4.2)

\begin{matrix} \sqrt{n} [f (\hat{θ}) - f (θ)] \overset{d}{\to} N (0, \nabla f {(θ)}^{T} Σ \nabla f (θ)), \end{matrix}

where $\nabla f (θ)$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\nabla f(\theta )$$\end{document} denotes the gradient of f at $θ$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\theta $$\end{document} .

In the case of Cohen’s kappa and Fleiss’s kappa, we find that

(4.3)

\begin{matrix} \nabla κ_{d} = & \frac{1}{C_{d}} (- 1, \frac{D_{d}}{C_{d}}), \nabla π_{d} = \frac{1}{F_{d}} (- 1, \frac{D_{d}}{F_{d}}) . \end{matrix}

Using some algebra, the expressions for the asymptotic variances follow from Lemma 1 and the above gradients.

Proposition 1

Then Cohen’s kappa ${\hat{κ}}_{d}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\hat{\kappa }_{d}$$\end{document} and Fleiss’s kappa ${\hat{π}}_{d}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\hat{\pi }_{d}$$\end{document} are asymptotically normal, and their asymptotic variances are

(4.4)

\begin{matrix} σ_{κ}^{2} = & σ_{D}^{2} \frac{1}{C_{d}^{2}} - 2 σ_{CD} \frac{D_{d}}{C_{d}^{3}} + σ_{C}^{2} \frac{D_{d}^{2}}{C_{d}^{4}}, \\ σ_{π}^{2} = & σ_{D}^{2} \frac{1}{F_{d}^{2}} - 2 σ_{FD} \frac{D_{d}}{F_{d}^{3}} + σ_{F}^{2} \frac{D_{d}^{2}}{F_{d}^{4}} . \end{matrix}

These results are also valid for ${\hat{κ}}_{d}^{u}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\hat{\kappa }_{d}^{u}$$\end{document} and ${\hat{π}}_{d}^{u}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\hat{\pi }_{d}^{u}$$\end{document} . Since the sample Krippendorff’s alpha (see note below) is equal to ${\hat{α}}_{d} = {\hat{π}}_{d} + \frac{1}{2 R n} (1 - {\hat{π}}_{d})$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\hat{\alpha }_{d}=\hat{\pi }_{d}+\frac{1}{2Rn}(1-\hat{\pi }_{d})$$\end{document} , it is also asymptotically normal with asymptotic variance $σ_{π}^{2}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\sigma _{\pi }^{2}$$\end{document} .

With $g = 2$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$g=2$$\end{document} and a finite number of categories, Schouten (Reference Schouten1980) derived $σ_{π}^{2}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\sigma _{\pi }^{2}$$\end{document} , while Schouten (Reference Schouten1982) and O’Connell and Dobson (Reference O’Connell and Dobson1984) derived $σ_{κ}^{2}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\sigma _{\kappa }^{2}$$\end{document} . The result for Krippendorff’s alpha is, to our knowledge, new.

A brief aside on Krippendorff’s alpha Krippendorff’s alpha (Krippendorff, Reference Krippendorff1970) is an agreement coefficient especially popular in content analysis (Krippendorff, Reference Krippendorff2018). It has no population definition, but its sample definition equals ${\hat{α}}_{d} = {\hat{π}}_{d} + \frac{1}{N} (1 - {\hat{π}}_{d})$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\hat{\alpha }_{d}=\hat{\pi }_{d}+\frac{1}{N}(1-\hat{\pi }_{d})$$\end{document} (the total sample size N equals 2Rn in the case of a rectangular design); see Proposition 3 in Appendix for a justification. For this reason, all of the results about the limit of ${\hat{π}}_{d}^{u}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\hat{\pi }_{d}^{u}$$\end{document} apply to Krippendorff’s alpha as well, as it is an asymptotically equivalent estimator of $π_{d}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\pi _{d}$$\end{document} . Note, however, that Krippendorff (Reference Krippendorff2018) emphasizes the use of non-rectangular designs, and the limit results in the preceding section do not hold for such study designs.

4.2. Estimating the Variances

The unknown quantities ${\hat{D}}_{d}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\hat{D}_{d}$$\end{document} , ${\hat{C}}_{d}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\hat{C}_{d}$$\end{document} , and ${\hat{F}}_{d}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\hat{F}_{d}$$\end{document} can be estimated using their sample counterparts. The variances and covariances can be estimated using the empirical (co)variances of the estimated $\hat{μ}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\hat{\mu }$$\end{document} s. These have formulas

(4.5)

\begin{matrix} {\hat{μ}}_{d} (x_{i}) = & D_{d} (x_{i}), \\ {\hat{μ}}_{dC} (x_{i}) = & n^{- (g - 1)} \sum_{i_{1}, \dots, i_{g - 1}} C_{d} (x_{i}, x_{i_{1}}, \dots, x_{i_{g - 1}}), \\ {\hat{μ}}_{dF} (x_{i}) = & n^{- (g - 1)} \sum_{i_{1}, \dots, i_{g - 1}} F_{d} (x_{i}, x_{i_{1}}, \dots, x_{i_{g - 1}}), \end{matrix}

\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\begin{aligned} \hat{\mu }_{d}(\varvec{x}_{i})= & {} D_{d}(\varvec{x}_{i}),\nonumber \\ \hat{\mu }_{dC}(\varvec{x}_{i})= & {} n^{-(g-1)}\sum _{i_{1},\ldots ,i_{g-1}}C_{d}(\varvec{x}_{i}, \varvec{x}_{i_{1}},\ldots ,\varvec{x}_{i_{g-1}}),\nonumber \\ \hat{\mu }_{dF}(\varvec{x}_{i})= & {} n^{-(g-1)}\sum _{i_{1},\ldots ,i_{g-1}}F_{d}(\varvec{x}_{i}, \varvec{x}_{i_{1}},\ldots ,\varvec{x}_{i_{g-1}}), \end{aligned}$$\end{document}

where the index sets run over all combinations with repetitions of $(i_{1}, i_{2}, \dots, i_{g - 1})$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$(i_{1},i_{2},\ldots ,i_{g-1})$$\end{document} .

Observe that estimating ${\hat{μ}}_{dC}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\hat{\mu }_{dC}$$\end{document} and ${\hat{μ}}_{dF}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\hat{\mu }_{dF}$$\end{document} directly is computationally very expensive, especially when done without binning, which cannot be done with continuous data. The obvious computation of all ${\hat{μ}}_{dC}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\hat{\mu }_\text {dC}$$\end{document} requires a number of operations on the order of $n^{g - 1}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$n^{g-1}$$\end{document} , which is prohibitively expensive for large n and g. However, there are few applications of agreement measures with very large n and g, so this should not be a serious problem in practice. We note that less computationally demanding procedures are possible for the quadratic Fréchet variance $V (d_{2}^{2})$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$V(d_2^2)$$\end{document} , as it can be shown that its associated kappas are invariant under g. Thus, we may use the computationally very effective methods for the concordance coefficient outlined by, e.g., Carrasco and Jover (Reference Carrasco and Jover2003).

From the definitions of ${\hat{D}}_{d}, {\hat{C}}_{d}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\hat{D}_{d},\hat{C}_{d}$$\end{document} , and ${\hat{F}}_{d},$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\hat{F}_{d},$$\end{document} (4), we quickly deduce that $\bar{{\hat{μ}}_{d}} = {\hat{D}}_{d}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\overline{\hat{\mu }_{d}}=\hat{D}_{d}$$\end{document} , $\bar{{\hat{μ}}_{dC}} = {\hat{C}}_{d}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\overline{\hat{\mu }_{dC}}=\hat{C}_{d}$$\end{document} and $\bar{{\hat{μ}}_{dF}} = {\hat{F}}_{d}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\overline{\hat{\mu }_{dF}}=\hat{F}_{d}$$\end{document} . Using this fact, we can define the estimators

\begin{matrix} {\hat{σ}}_{C}^{2} = \frac{g^{2}}{n - 1} \sum_{i = 1}^{n} {({\hat{μ}}_{dC} (x_{i}) - {\hat{C}}_{d})}^{2}, {\hat{σ}}_{CD}^{2} = \frac{g}{n - 1} \sum_{i = 1}^{n} ({\hat{μ}}_{dC} (x_{i}) - {\hat{C}}_{d}) ({\hat{μ}}_{d} (x_{i}) - {\hat{D}}_{d}), \end{matrix}

and ${\hat{σ}}_{D}^{2} = \frac{1}{n - 1} \sum_{i = 1}^{n} {({\hat{μ}}_{d} (x_{i}) - {\hat{D}}_{d})}^{2}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\hat{\sigma }_{D}^{2}=\frac{1}{n-1}\sum _{i=1}^{n}(\hat{\mu }_{d} (\varvec{x}_{i})-\hat{D}_{d})^{2}$$\end{document} . Moreover, we can estimate ${\hat{σ}}_{F}^{2}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\hat{\sigma }_{F}^{2}$$\end{document} and ${\hat{σ}}_{FD}^{2}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\hat{\sigma }_{FD}^{2}$$\end{document} in the same way, substituting ${\hat{μ}}_{dF}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\hat{\mu }_{dF}$$\end{document} for ${\hat{μ}}_{dC}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\hat{\mu }_{dC}$$\end{document} . Using the formulas for the theoretical variances (4.4), we find the estimators

(4.6)

\begin{matrix} {\hat{σ}}_{κ}^{2} = {\hat{σ}}_{D}^{2} \frac{1}{{\hat{C}}_{d}^{2}} - 2 {\hat{σ}}_{CD} \frac{{\hat{D}}_{d}}{{\hat{C}}_{d}^{3}} + {\hat{σ}}_{C}^{2} \frac{{\hat{D}}_{d}^{2}}{{\hat{C}}_{d}^{4}}, \end{matrix}

(4.7)

\begin{matrix} {\hat{σ}}_{π}^{2} = {\hat{σ}}_{D}^{2} \frac{1}{{\hat{F}}_{d}^{2}} - 2 {\hat{σ}}_{FD} \frac{{\hat{D}}_{d}}{{\hat{F}}_{d}^{3}} + {\hat{σ}}_{F}^{2} \frac{{\hat{D}}_{d}^{2}}{{\hat{F}}_{d}^{4}} . \end{matrix}

The variance estimator ${\hat{σ}}_{π}^{2}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\hat{\sigma }_{\pi }^{2}$$\end{document} coincides with that of Gwet (Reference Gwet2021, equation 4) in the case of nominal weights; see Appendix (Sect. 6) for a proof sketch.

4.3. Improving Approximate Normality with the Arcsine and Fisher Transforms

It is well known that the Fisher transform (Fisher, Reference Fisher1915) improves the inference for the correlation coefficient. If r is the sample correlation, $artanh (r) = \frac{1}{2} log [(1 + r) / (1 - r)]$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${{\,\textrm{artanh}\,}}(r)=\frac{1}{2}\log [(1+r)/(1-r)]$$\end{document} has approximately the same variance for most r, and its distribution is closer to normal than that of the untransformed r, especially when the population correlation is close to $\pm 1$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\pm 1$$\end{document} . This transform makes sense outside the world of correlations; for instance, Lin (Reference Lin1989) used the Fisher transform to improve the normality of the quadratically weighted Cohen’s kappa.

The arcsine is another reasonable transformation of ${\hat{κ}}_{d}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\hat{\kappa }_{d}$$\end{document} and ${\hat{π}}_{d}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\hat{\pi }_{d}$$\end{document} . The arcsine is the inverse of the sine function and is defined as $arcsin x = \int 1 / \sqrt{1 - x^{2}} d x$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\arcsin x=\int 1/\sqrt{1-x^{2}}\textrm{d}x$$\end{document} . In ecology (Warton and Hui, Reference Warton and Hui2011), the arcsine transformation denotes $arcsin \sqrt{p}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\arcsin \sqrt{p}$$\end{document} , where p is a probability. We do not take square root, however, as ${\hat{κ}}_{d}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\hat{\kappa }_{d}$$\end{document} and ${\hat{π}}_{d}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\hat{\pi }_{d}$$\end{document} can be negative.

Calculating the limiting variance of $arcsin {\hat{κ}}_{d}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\arcsin \hat{\kappa }_{d}$$\end{document} and $arcsin {\hat{π}}_{d}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\arcsin \hat{\pi }_{d}$$\end{document} requires an additional application of the delta method (4.2). Using that $\frac{d}{d x} arcsin (x) = 1 / \sqrt{1 - x^{2}}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\frac{\textrm{d}}{\textrm{d}x}\arcsin (x)=1/\sqrt{1-x^{2}}$$\end{document} and $\frac{d}{d x} artanh (x) = 1 / (1 - x^{2})$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\frac{\textrm{d}}{\textrm{d}x}{{\,\textrm{artanh}\,}}(x)=1/(1-x^{2})$$\end{document} , we find

(4.8)

\begin{matrix} \sqrt{n} (arcsin {\hat{κ}}_{d} - arcsin κ_{d}) \to & N (0, {(1 - κ_{d}^{2})}^{- 1} σ_{κ}^{2}), \end{matrix}

(4.9)

\begin{matrix} \sqrt{n} (artanh {\hat{κ}}_{d} - artanh κ_{d}) \to & N (0, {(1 - κ_{d}^{2})}^{- 2} σ_{κ}^{2}) . \end{matrix}

Expressions for ${\hat{π}}_{d}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\hat{\pi }_{d}$$\end{document} can be found by swapping $κ_{d}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\kappa _{d}$$\end{document} for $π_{d}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\pi _{d}$$\end{document} and $σ_{κ}^{2}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\sigma _{\kappa }^{2}$$\end{document} for $σ_{π}^{2}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\sigma _{\pi }^{2}$$\end{document} .

Example 3

This example illustrates that the arcsine and Fisher transforms may make the sampling distribution closer to the normal distribution. Let the number of raters be $R = 3$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$R=3$$\end{document} , the disagreement function be quadratic (with $g = 2$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$g=2$$\end{document} ), and the number of items be $n = 20$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$n=20$$\end{document} . There are five categories and the true classification of an item is one of ${1, 2, 3, 4, 5}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\{1,2,3,4,5\}$$\end{document} with probability 1/5 each. Every rater knows the true classification of an item with probability 0.9. If they do not know the correct classification, they will guess a classification from ${1, 2, 3, 4, 5}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\{1,2,3,4,5\}$$\end{document} uniformly at random. One can show that the population value of the quadratically weighted Cohen’s kappa is 0.816 under these circumstances, following the arguments of Perreault and Leigh (Reference Perreault and Leigh1989). We simulate the value of ${\hat{κ}}_{d}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\hat{\kappa }_{d}$$\end{document} a total of $N = 50, 000$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$N=50,000$$\end{document} times and transform them using the identity transform, the arcsine transform, and the Fisher transform. The results are shown in Fig. 1. The arcsine transform appears to bring the sampling distribution of ${\hat{κ}}_{d}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\hat{\kappa }_{d}$$\end{document} closer to the normal distribution, with the Fisher transform also improving normality quite a bit.

Figure 1 Simulated sampling distribution of ${\hat{κ}}_{d}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\hat{\kappa }_{d}$$\end{document} for quadratic weights using three transformations, $n = 20, R = 3$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$n=20, R=3$$\end{document} . The simulation setup is described in Example 3. The arcsine transform makes the sampling distribution closest to the normal distribution.

5. Confidence Intervals

Using the methodology we have developed, we can easily construct confidence intervals for the agreement coefficients.

We describe our three confidence interval constructions only for Cohen’s kappa, as the intervals using Fleiss’ kappa can be found by replacing every instance ${\hat{κ}}_{d}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\hat{\kappa }_{d}$$\end{document} with ${\hat{π}}_{d}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\hat{\pi }_{d}$$\end{document} and ${\hat{σ}}_{κ}^{2}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\hat{\sigma }_{\kappa }^{2}$$\end{document} with ${\hat{σ}}_{π}^{2}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\hat{\sigma }_{\pi }^{2}$$\end{document} . We use the two-sided t-distribution-based confidence intervals with nominal level $1 - α = 0.95$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$1-\alpha =0.95$$\end{document} . Let c be the $(1 - α / 2)$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$(1-\alpha /2)$$\end{document} -quantile of the t distribution with $n - 1$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$n-1$$\end{document} degrees of freedom. The basic interval is

(5.1)

\begin{matrix} [{\hat{κ}}_{d} - c {\hat{σ}}_{κ} / \sqrt{n - 1}, {\hat{κ}}_{d} + c {\hat{σ}}_{κ} / \sqrt{n - 1}], \end{matrix}

where ${\hat{σ}}_{κ}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\hat{\sigma }_\kappa $$\end{document} is the estimated variance described in equation (4.6).

The arcsine interval replaces the basic limits with

(5.2)

\begin{matrix} sin (arcsin {\hat{κ}}_{d} \pm c {(1 - {\hat{κ}}_{d}^{2})}^{- 1 / 2} {\hat{σ}}_{κ} / \sqrt{n - 1}), \end{matrix}

where ${(1 - {\hat{κ}}_{d}^{2})}^{- 1} {\hat{σ}}_{κ}^{2}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$(1-\hat{\kappa }_{d}^{2})^{-1}\hat{\sigma }_{\kappa }^{2}$$\end{document} is the asymptotic variance of $arcsin {\hat{κ}}_{d}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\arcsin \hat{\kappa }_{d}$$\end{document} (4.8). The Fisher interval uses the area hyperbolic tangent,

(5.3)

\begin{matrix} tanh (artanh {\hat{κ}}_{d} \pm c {(1 - {\hat{κ}}_{d}^{2})}^{- 1} {\hat{σ}}_{κ} / \sqrt{n - 1}), \end{matrix}

where ${(1 - {\hat{κ}}_{d}^{2})}^{- 2} {\hat{σ}}_{κ}^{2}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$(1-\hat{\kappa }_{d}^{2})^{-2}\hat{\sigma }_{\kappa }^{2}$$\end{document} is the asymptotic variance of $artanh {\hat{κ}}_{d}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${{\,\textrm{artanh}\,}}\hat{\kappa }_{d}$$\end{document} (4.9).

Using the methodology just described, we can calculate confidence intervals for the Fleiss (Reference Fleiss1971) data of Example 2.

Example 4

(Ex. 2 cont.) Using the data of Fleiss (Reference Fleiss1971), we calculate arcsine confidence intervals for the g-wise Fleiss’s kappa. The raters are not the same for all items, but it seems plausible to assume that the ratings are exchangeable given the item. The diagnoses are essentially categorical in nature; hence, we will only consider $V (d_{0})$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$V(d_{0})$$\end{document} and Hubert’s disagreement function. The results are shown in Table 3. We see that the agreement coefficients agree when $g = 2$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$g=2$$\end{document} , as both $V (d_{0})$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$V(d_{0})$$\end{document} and Hubert’s disagreement function equals the nominal agreement in this case. But the coefficients differ substantially as g increases. This is to be expected, as Hubert’s disagreement function measures consensus while $V (d_{0})$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$V(d_{0})$$\end{document} measures the number of observations different from the mode. Observe that $V (d_{0})$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$V(d_0)$$\end{document} is not invariant with respect to g, hence it is a proper alternative to the classical Fleiss’s kappa. Moreover, all confidence intervals are of comparable length.

Table 3 Confidence intervals for the data of Fleiss (Reference Fleiss1971) using the arcsine method.

*This is Hubert’s kappa when the Hubert disagreement is used.

$^{†}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$^{\dagger }$$\end{document} Hubert disagreement equals the nominal disagreement $V (d_{0})$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$V(d_{0})$$\end{document} when $g = 2$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$g=2$$\end{document} .

The preceding example fits best into the context of Fleiss’ kappa, as the identity of the raters are unknown. Moreover, there is no ordinal structure in the data, making the $V (d_{1})$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$V(d_1)$$\end{document} and $V (d_{2}^{2})$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$V(d_2^2)$$\end{document} distances unnatural to employ. Our next example concerns the Fréchet variances applied to a case of ordinal data when the identity of the raters are known.

Example 5

Zapf et al. (Reference Zapf, Castell, Morawietz and Karch2016) studied bootstrap intervals for Fleiss’s kappa and Krippendorff’s alpha using simulations and a case study. Their case study concerned the histopathological assessment of breast cancer and involved ratings performed by $R = 4$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$R=4$$\end{document} senior pathologists and $n = 50$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$n=50$$\end{document} breast cancer biopsies. We apply the arcsine method to calculate confidence intervals and point estimates, displayed in Table 4. We focus on Cohen’s kappa since the same four pathologists rate each cancer biopsy, but we include a column for Fleiss’s kappa when $g = 4$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$g=4$$\end{document} for comparison’s sake. When $g = 4$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$g=4$$\end{document} , Cohen’s kappa and Fleiss’s kappa are as good as indistinguishable. As can be verified by using the code in the supplementary material, this happens for the other gs as well. It is not generally the case that Fleiss’s kappa and Cohen’s kappa nearly coincide, but it is likely to happen if the marginal ratings are approximately the same for all raters, as is the case in this data set. There is a sizable difference between the disagreement functions, but there is not typically a big difference when changing gs, provided we keep the disagreement functions constant. It remains to be seen whether this is common or not. The exception is Hubert’s disagreement function, which decreases quite a bit. (As in the Fleiss (Reference Fleiss1971) example, this is expected, as the Hubert’s disagreement function is a consensus measure.) Observe that the kappas under the quadratic Fréchet variance $V (d_{2}^{2})$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$V(d^2_2)$$\end{document} do not change with g, which is always the case.

Table 4 Confidence intervals for Zapf et al. (Reference Zapf, Castell, Morawietz and Karch2016) using the arcsine method.

5.1. Simulation of Confidence Sets When $g = 2$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$g=2$$\end{document}

We include a small simulation study on the performance of confidence sets using two models: A Perreault–Leigh model for discrete rating data and a normal model for continuous rating data. For both models, we investigate the following parameters:

(i) Number of raters R. We use 2, 5, 20, which corresponds to a small, medium, and large selection of raters.
(ii) Sample sizes n. We use $n = 10, 40, 100$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$n=10,40,100$$\end{document} , corresponding to small, medium, and large agreement studies.
(iii) Disagreement functions. Nominal disagreement $1 [x \neq y]$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$1[x\ne y]$$\end{document} , quadratic disagreement ${(x - y)}^{2}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$(x-y)^{2}$$\end{document} , and absolute value disagreement $| x - y |$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$|x-y|$$\end{document} .
(iv) Methods. A basic interval without transformations, an arcsine-transformed interval, and a Fisher transformed interval.

5.1.1. A Perreault–Leigh Model

Perreault and Leigh (Reference Perreault and Leigh1989) discussed a particular model for ratings in which each rated user either knows the correct answer or guesses uniformly at random. Similar models have been used by Gwet (Reference Gwet2008); Maxwell (Reference Maxwell1977), among others; see Moss (Reference Moss2023) for a thorough discussion of such models. We assume there are five categories encoded as $C = {- 2, - 1, 0, 1, 2}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$C=\{-2,-1,0,1,2\}$$\end{document} , and the distribution of the true classification distribution is uniform. For each item rated, the rth rater knows the correct classification with probability $\sqrt{0.8}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\sqrt{0.8}$$\end{document} . If not, he guesses, picking a number from C uniformly at random. Then $κ_{d} = π_{d} = 0.8$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\kappa _{d}=\pi _{d}=0.8$$\end{document} for all weights and the number of raters, as can be verified by following the arguments of Perreault and Leigh (Reference Perreault and Leigh1989). We run each simulation $N = 10, 000$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$N=10,000$$\end{document} times.

The simulated lengths and coverages for Cohen’s kappa are given in Table 5. Two features stand out in Table 5. First, the confidence intervals have almost indistinguishable lengths and coverages when either R or n is large. Second, the basic interval has worse coverage than the arcsine and Fisher intervals when n is small, with the Fisher interval having coverage slightly closer to nominal than the arcsine interval. However, the better nominal coverage comes at the expense of greater lengths. In particular, for the absolute value weight, the coverage of the arcsine interval is greater than the coverage of the Fisher interval, but its length is shorter! The table for Fleiss’s kappa is similar and can be found in Appendix, Table 8.

Table 5 Coverage (first entry) and lengths (second entry) of confidence intervals: Perreault–Leigh model, Cohen’s kappa.

Coverages greater than 0.95 are in bold.

5.1.2. Normal Model

In this study, the rating data is distributed according to the multivariate normal $N (0, Σ)$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$N(0,\Sigma )$$\end{document} , where $Σ$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\Sigma $$\end{document} is the $R \times R$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$R\times R$$\end{document} correlation matrix with off-diagonal elements $Σ_{r_{i} r_{j}} = ρ$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\Sigma _{r_{i}r_{j}}=\rho $$\end{document} . Since the data is continuous, we study the absolute value disagreement $d_{1}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$d_{1}$$\end{document} and the quadratic disagreement $d_{2}^{2}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$d_{2}^{2}$$\end{document} only. The true values are $κ_{d_{2}} = π_{d_{2}^{2}} = ρ$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\kappa _{d_{2}}=\pi _{d_{2}^{2}}=\rho $$\end{document} and $κ_{d_{1}} = π_{d_{1}} = 1 - \sqrt{1 - ρ}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\kappa _{d_{1}}=\pi _{d_{1}}=1-\sqrt{1-\rho }$$\end{document} . See Appendix (Sect. 6) for details on the computation of these true values. We use $ρ = 0.7$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\rho =0.7$$\end{document} , and hence, $κ_{d_{2}^{2}} = 0.7$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\kappa _{d_{2}^{2}}=0.7$$\end{document} and $κ_{d_{1}} = 0.45$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\kappa _{d_{1}}=0.45$$\end{document} . We run each simulation $N = 1, 000$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$N=1,000$$\end{document} times.Footnote 5 We note that agreement coefficients are often called concordance coefficients when dealing with continuous data, especially when the quadratic distance is used. Lin’s concordance coefficient (Lin, Reference Lin1989, Reference Lin1992) is a prominent example.

The simulated lengths and coverages for Cohen’s kappa are given in Table 6. There is barely any difference between the three confidence interval constructions. Taken together with the results for the Perreault–Leigh model, where the basic interval performs worse than the other two, we would recommend the usage of either the arcsine or Fisher interval. Again, the table for Fleiss’s kappa is very similar and can be found in Appendix (Table 9).

Table 6 Coverage (first entry) and lengths (second entry) of confidence intervals: normal model, Cohen’s kappa.

Coverages greater than 0.95 are in bold.

5.2. Simulation of Confidence Sets when $g \neq 2$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$g\ne 2$$\end{document}

Table 7 contains simulations from the Perreault–Leigh model (Sect. 5.1.1) with $N = 1000$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$N=1000$$\end{document} repetitions and $R = 5$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$R=5$$\end{document} raters using the Fréchet variances $V (d_{0})$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$V(d_{0})$$\end{document} , $V (d_{1})$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$V(d_{1})$$\end{document} , and Hubert’s disagreement function. We drop $V (d_{2}^{2})$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$V(d_{2}^{2})$$\end{document} since it does not vary with g. To save space, we drop the basic confidence interval in the simulation. As before, we show the results only for the Cohen-type disagreement, with the Fleiss-type disagreement relegated to Appendix (Table 10). All coverages are decent, and the coverages and lengths are similar across the board.

Table 7 Coverage (first entry) and lengths (second entry) of confidence intervals for g-wise coefficients: Perreault–Leigh model, Cohen’s kappa.

Coverages greater than 0.95 are in bold.

6. Concluding Remarks

When choosing an agreement coefficient one has to carefully think through exactly what one wishes to measure. The Fréchet variances are attractive because of their interpretation. You measure how much the raters disagree with the generalized mean rater, and then adjust for chance. In the case of nominal data, we measure the disagreement with the modal rater. When dealing with numerical data, we may measure disagreement with the median rater (using the absolute value distance), or the mean rater (using the quadratic distance), or use any other Fréchet variance defined on numeric data.

When dealing with nominal data, we believe that using the Fréchet variance, which measures the distance from the mode, is a reasonable choice. But other options are certainly possible, even when dealing with g-wise agreement measures. For example, one could use the entropy instead, with distance measure $d (x_{1}, x_{2}, \dots, x_{g}) = - \sum_{i = 1}^{g} \frac{# i}{g} log \frac{# i}{g}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$d(x_{1},x_{2},\ldots ,x_{g})=-\sum _{i=1}^{g}\frac{\#i}{g}\log \frac{\#i}{g}$$\end{document} , where $# i$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\#i$$\end{document} counts the number of elements in $(x_{1}, x_{2}, \dots, x_{g})$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$(x_{1},x_{2},\ldots ,x_{g})$$\end{document} classified as i, which could be useful when the number of raters is finite but large. The topic of how to choose reasonable distance measures for g-wise agreement studies has not been thoroughly studied, and there might be options preferable to the Fréchet variances that have not yet been found.

We have only covered rectangular design, where every item is rated by the same number of raters. It is quite easy to generalize the definitions of $κ_{d}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\kappa _{d}$$\end{document} and $π_{d}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\pi _{d}$$\end{document} to non-rectangular designs, as we have done in Appendix, Sect. 6. But inference appears to be quite difficult, probably requiring additional assumptions for the case of non-exchangeable ratings.

In Sect. 4, we introduced the U-statistic-based estimators of $C_{d}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$C_d$$\end{document} and $F_{d}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$F_d$$\end{document} , but only used them for theoretical purposes. The U-statistic-based estimators may plausibly outperform the classical V-statistic-based estimators since they are minimum variance unbiased estimators. It would be interesting to see whether the U-statistic-based estimators could outperform the traditional V-statistic-based estimators when n is small, for example in terms of mean squared error or confidence interval coverage.

The confidence intervals based on the arcsine and Fisher transforms perform better than the basic, untransformed interval. It is unclear which one of these intervals to prefer, but it barely matters when the sample size is sufficiently large. It might be possible to improve all of these intervals. Small-sample corrections to the variance appear feasible, with potential openings in the application of the delta rule and in the calculation of $Σ$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\Sigma $$\end{document} of Lemma 1. We have used the arcsine and Fisher transforms to improve approximate normality of ${\hat{κ}}_{d}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\hat{\kappa }_{d}$$\end{document} and ${\hat{π}}_{d}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\hat{\pi }_{d}$$\end{document} , but this choice is semi-arbitrary. Better variance-stabilizing transformations might be found by inspecting the formula for the variances of ${\hat{κ}}_{d}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\hat{\kappa }_{d}$$\end{document} and ${\hat{π}}_{d}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\hat{\pi }_{d}$$\end{document} in Proposition 1. The confidence intervals used in the simulation are only known to be first-order accurate. To make second-order accurate confidence intervals, it would be possible to use the explicit formula for the variances to construct studentized confidence intervals, i.e., bootstrap-t intervals (Efron, Reference Efron1987), which are second-order accurate.

None of these approaches is guaranteed to help when n is small, especially when dealing with categorical data, as the sampling distributions of ${\hat{κ}}_{d}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\hat{\kappa }_{d}$$\end{document} and ${\hat{π}}_{d}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\hat{\pi }_{d}$$\end{document} are discrete and highly irregular. For example, consider the sample distribution of the Perreault–Leigh model (Sect. 5.1) when $n = 20$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$n=20$$\end{document} and $R = 20$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$R=20$$\end{document} , displayed in Fig. 2. (We omit a dominating spike at 1.) As there are $C = 5 < \infty$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$C=5<\infty $$\end{document} categories, there is a finite number of possible values for ${\hat{κ}}_{d}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\hat{\kappa }_{d}$$\end{document} to take, which is strongly reflected in the plots, especially for the nominal weight.

Figure 2 Sample distribution of ${\hat{κ}}_{d}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\hat{\kappa }_{d}$$\end{document} for nominal (left) and absolute value (right) weights. Both plots omit a dominating spike at 1. Here $n = 20$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$n=20$$\end{document} and $j = 5$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$j=5$$\end{document} , and we use the Perreault–Leigh model (same parameters as in Sect. 5.1) to simulate the data. There were 2573 unique values for the nominal weight and 8790 unique values for the absolute value weight after $N = 200, 000$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$N=200{,}000$$\end{document} simulations.

The superior performance of methods such as the bootstrap-t depends on the quantity $\frac{\hat{θ} - θ}{se}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\frac{\hat{\theta }-\theta }{{{\,\textrm{se}\,}}}$$\end{document} being approximately pivotal, that is, approximately the same for all parameters, possibly after applying a transformation. Judging from the plots in Fig. 2, there is no such transformation.

Funding

Open access funding provided by Norwegian Business School.

Appendix

Agreement Versus Disagreement

Agreement weighting functions are frequently standardized to guarantee that $w (x_{1}, x_{2}) \geq 0$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$w(x_{1},x_{2})\ge 0$$\end{document} , e.g., $w (x_{1}, x_{2}) = 1 - | x_{1} - x_{2} | / max (| x_{1} - x_{2} |)$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$w(x_{1},x_{2})=1-|x_{1}-x_{2}|/\max (|x_{1}-x_{2}|)$$\end{document} for the absolute value weights. Standardization is not necessary, as they do not change the values of $κ_{d}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\kappa _{d}$$\end{document} and $π_{d}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\pi _{d}$$\end{document} when it is possible (i.e., when $max (| x_{1} - x_{2} |) < \infty$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\max (|x_{1}-x_{2}|)<\infty $$\end{document} ), and is not defined otherwise. We choose not to use this operation, as it does not change the value of the agreement coefficients in this paper and is impossible to do when the range of $x_{1}, x_{2}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$x_{1},x_{2}$$\end{document} is unbounded.

Proof of Equivalence Between $V (d_{p}) (x_{1}, x_{2})$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$V(d_{p})(\varvec{x}_{1},\varvec{x}_{2})$$\end{document} and $| | x_{1} - x_{2} | |$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$||\varvec{x}_{1}-\varvec{x}_{2}||$$\end{document}

Proof

We will show that

\begin{matrix} V (d_{p}) [x_{1}, x_{2}] = \frac{1}{2} | | x_{1} - x_{2} {| |}_{p}, V (d_{p}^{p}) [x_{1}, x_{2}] = \frac{1}{2^{p}} | | x_{1} - x_{2} {| |}_{p}^{p} . \end{matrix}

First, consider the case when $p \geq 1$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$p\ge 1$$\end{document} . Using translation invariance and homogeneity of the norm,

\begin{matrix} | | x_{1} - {μ | |}_{p} + | | x_{2} - μ {| |}_{p}, \\ = | | x_{1} - \frac{x_{1} + x_{2}}{2} - μ + \frac{x_{1} + x_{2}}{2} {| |}_{p} + | | x_{2} - \frac{x_{1} + x_{2}}{2} - μ + \frac{x_{1} + x_{2}}{2} {| |}_{p}, \\ = | | \frac{x_{1} - x_{2}}{2} - {ν | |}_{p} + | | - \frac{x_{1} - x_{2}}{2} - ν {| |}_{p}, \\ = | | a - {ν | |}_{p} + | | a + {ν | |}_{p}, \end{matrix}

where $a = \frac{x_{1} - x_{2}}{2}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$a=\frac{x_{1}-x_{2}}{2}$$\end{document} and $ν = μ - \frac{x_{1} + x_{2}}{2}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\nu =\mu -\frac{x_{1}+x_{2}}{2}$$\end{document} .

Observe that

\begin{matrix} {argmin}_{ν} | | a + {ν | |}_{p} + | | a - {ν | |}_{p} = 0, for all a \end{matrix}

implies $μ = \frac{x_{1} + x_{2}}{2}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\mu =\frac{x_{1}+x_{2}}{2}$$\end{document} .

By the Minkowski inequality,

\begin{matrix} 2^{p} {| | a | |}^{p} = | | a + ν + a - {ν | |}^{p} \leq (| | a - ν | | + | | a + {ν | |)}^{p} . \end{matrix}

This is an equality if $| | a - ν | | = | | a + ν | | = | | a | |$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$||a-\nu ||=||a+\nu ||=||a||$$\end{document} , i.e., when $ν = 0,$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\nu =0,$$\end{document} as the left side equals $(| | a - μ | | + | | a + {μ | |)}^{p} = 2^{p} {| | a | |}^{p}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$(||a-\mu ||+||a+\mu ||)^{p}=2^{p}||a||^{p}$$\end{document} . Now it is easy to verify that $V (d_{p})$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$V(d_{p})$$\end{document} and $V (d_{p}^{p})$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$V(d_{p}^{p})$$\end{document} have the claimed form; just substitute the value $μ = \frac{x_{1} + x_{2}}{2}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\mu =\frac{x_{1}+x_{2}}{2}$$\end{document} into the formula for the Fréchet variance, $\frac{1}{2} (| | x_{1} - {μ | |}_{p} + | | x_{2} - {μ | |}_{p})$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\frac{1}{2}(||x_{1}-\mu ||_{p}+||x_{2}-\mu ||_{p})$$\end{document} .

When $0 < p < 1$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$0<p<1$$\end{document} , the function $μ \mapsto | | x_{1} - {μ | |}_{p} + | | x_{2} - μ {| |}_{p}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\mu \mapsto ||x_{1}-\mu ||_{p}+||x_{2}-\mu ||_{p}$$\end{document} is stepwise concave on $[- \infty, x_{1}]$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$[-\infty ,x_{1}]$$\end{document} , $[x_{1}, x_{2}]$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$[x_{1},x_{2}]$$\end{document} , and $[x_{2}, \infty)$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$[x_{2},\infty )$$\end{document} ; hence, its minimum is either $x_{1}, x_{2}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$x_{1},x_{2}$$\end{document} , or both. It is clear that both $x_{1}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$x_{1}$$\end{document} and $x_{2}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$x_{2}$$\end{document} maps to $| | x_{1} - x_{2} {| |}_{p}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$||x_{1}-x_{2}||_{p}$$\end{document} ; hence, both are Fréchet means. The case $p = 0$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$p=0$$\end{document} is obvious and omitted. $□$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\square $$\end{document}

True Values in the Normal Simulation

We give a brief explanation why the true values of $κ_{d}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\kappa _{d}$$\end{document} and $π_{d}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\pi _{d}$$\end{document} are 0.8 for the quadratic weights and $1 - \sqrt{0.2}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$1-\sqrt{0.2}$$\end{document} for the absolute value weights.

First notice that, since the marginals of $X_{r_{1}}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$X_{r_{1}}$$\end{document} and $X_{r_{2}}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$X_{r_{2}}$$\end{document} are equal for all $r_{1}, r_{2}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$r_{1},r_{2}$$\end{document} , we have that $κ_{d} = π_{d}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\kappa _{d}=\pi _{d}$$\end{document} . Moreover, we can ignore the number of raters, since the pairwise distribution do not depend on them. Then, from standard theory about the multivariate and folded normal, we find that

\begin{matrix} E (| X_{r_{1}} - X_{r_{2}} |) = 2 \sqrt{\frac{1 - ρ}{π}}, E (| X_{r_{1}} - X_{r_{2}} |^{2}) = 2 (1 - ρ) . \end{matrix}

Let $X_{r_{1}}^{'}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$X'_{r_{1}}$$\end{document} be a copy of $X_{r_{1}}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$X_{r_{1}}$$\end{document} that is independent of $X_{r_{2}}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$X_{r_{2}}$$\end{document} . Then $E (| X_{r_{1}}^{'} - X_{r_{2}} |) = 2 / \sqrt{π}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$E(|X'_{r_{1}}-X_{r_{2}}|)=2/\sqrt{\pi }$$\end{document} and $E (| X_{r_{1}}^{'} - X_{r_{2}} |^{2}) = 2$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$E(|X'_{r_{1}}-X_{r_{2}}|^{2})=2$$\end{document} . Now rewrite the kappas using disagreement instead of agreement. Use the fact that $(p_{wa} - p_{fa}) / (1 - p_{fa}) = 1 - d_{wa} / d_{fa}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$(p_{wa}-p_{fa})/(1-p_{fa})=1-d_{wa}/d_{fa}$$\end{document} , where $d_{wa} = 1 - E (w (X_{r_{1}}, X_{r_{2}}))$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$d_{wa}=1-E(w(X_{r_{1}},X_{r_{2}}))$$\end{document} and $d_{fa} = 1 - E (w (X_{r_{1}}^{'}, X_{r_{2}}))$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$d_{fa}=1-E(w(X'_{r_{1}},X_{r_{2}}))$$\end{document} , where $X_{r_{1}}^{'}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$X'_{r_{1}}$$\end{document} is a copy of $X_{r_{1}}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$X_{r_{1}}$$\end{document} that is independent of $X_{r_{2}}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$X_{r_{2}}$$\end{document} .

Thus, $κ_{d} = π_{d} = 1 - E (| X_{r_{1}} - X_{r_{2}} |) / E (| X_{r_{1}}^{'} - X_{r_{2}} |^{2}) = 1 - \sqrt{1 - ρ}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\kappa _{d}=\pi _{d}=1-E(|X_{r_{1}}-X_{r_{2}}|)/E(|X'_{r_{1}}-X_{r_{2}}|^{2})=1-\sqrt{1-\rho }$$\end{document} for the absolute value weights and $1 - E (| X_{r_{1}} - X_{r_{2}} |^{2}) / E (| X_{r_{1}}^{'} - X_{r_{2}} |^{2}) = ρ$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$1-E(|X_{r_{1}}-X_{r_{2}}|^{2})/E(|X'_{r_{1}}-X_{r_{2}}|^{2})=\rho $$\end{document} for the quadratic weights.

Variance of U-Statistics

Let $U_{n}^{1}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$U_{n}^{1}$$\end{document} and $U_{n}^{2}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$U_{n}^{2}$$\end{document} be two U-statistics of n observations with symmetric kernels $ψ_{1}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\psi _{1}$$\end{document} , $ψ_{2}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\psi _{2}$$\end{document} of dimension $k_{1}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$k_{1}$$\end{document} and $k_{2}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$k_{2}$$\end{document} . Define

(6.1)

\begin{matrix} σ_{cc}^{2} = Cov (E [ψ_{1} (X_{1}, \dots, X_{k_{1}}) ∣ X_{1}, \dots, X_{c})], E [ψ_{2} (X_{1}, \dots, X_{k_{2}}) ∣ X_{1}, \dots, X_{c})]) . \end{matrix}

Proposition 2

The exact covariance of $U_{1}^{n}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$U_{1}^{n}$$\end{document} and $U_{2}^{n}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$U_{2}^{n}$$\end{document} is

\begin{matrix} Cov (U_{1}^{n}, U_{2}^{n}) = {(\begin{matrix} n \\ k_{1} \end{matrix})}^{- 1} \sum_{c = 1}^{k_{1}} (\begin{matrix} k_{2} \\ c \end{matrix}) (\begin{matrix} n - k_{2} \\ k_{1} - c \end{matrix}) σ_{cc}^{2} . \end{matrix}

If $k_{1}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$k_{1}$$\end{document} and $k_{2}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$k_{2}$$\end{document} are fixed, its asymptotic variance is $n Cov (U_{1}^{n}, U_{2}^{n}) \to k_{1} k_{2} σ_{12}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$n{{\,\textrm{Cov}\,}}(U_{1}^{n},U_{2}^{n})\rightarrow k_{1}k_{2}\sigma _{12}$$\end{document} .

Proof

See (Lee, Reference Lee2019, Theorem 2, p. 17) and (Lee, Reference Lee2019, Theorem 2, p. 76). $□$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\square $$\end{document}

Expanding the Definitions

Here is sketch of how we could expand the definitions in Sect. 2 to encompass more complicated scenarios. We restrict ourselves to $g = 2$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$g=2$$\end{document} , but the analysis can be expanded to arbitrary g. Suppose that any finite number of raters R is possible, the raters are not exchangeable, and that not every item is rated by every rater.

Let X denote a rating, R be the raters, and I be the items rated. Suppose we sample pairs $(X_{1}, R_{1}, I_{1}), (X_{2}, R_{2}, I_{2})$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$(X_{1},R_{1},I_{1}),(X_{2},R_{2},I_{2})$$\end{document} independently from the same distribution F. Then we may define

(6.2)

\begin{matrix} D_{d} = & E [d (X_{1}, X_{2}) ∣ I_{1} = I_{2}, R_{1} \neq R_{2}], \\ C_{d} = & E [d (X_{1}, X_{2}) ∣ R_{1} \neq R_{2}], \\ F_{d} = & E [d (X_{1}, X_{2})] . \end{matrix}

These quantities have natural sample analogues; e.g.,

\begin{matrix} {\hat{D}}_{d} = N^{- 1} \sum_{i = 1}^{n} \sum_{r_{1} \neq r_{2}} d (x_{i r_{1}}, x_{i r_{2}}), \end{matrix}

where N is the total number of paired observations and the rater indices run over the raters who observed at the ith observation x. Population and sample definitions of Cohen’s kappa and Fleiss’ kappa follow as laid out in the main text, e.g., $κ_{d} = 1 - D_{d} / C_{d}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\kappa _{d}=1-D_{d}/C_{d}$$\end{document} .

Table 8 Coverage (first entry) and lengths (second entry) of confidence intervals: Perreault–Leigh model, Fleiss’s kappa.

Table 9 Coverage (first entry) and lengths (second entry) of confidence intervals: Normal model, Fleiss’s kappa.

Table 10 Coverage (first entry) and lengths (second entry) of confidence intervals: Perreault–Leigh model, Fleiss’ kappa ( $R = 5$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$R=5$$\end{document} ).

Krippendorff’s Alpha

Now suppose that the ratings can take on only a finite number C distinct values. Define $o_{ck}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$o_{ck}$$\end{document} as the number of times a pair of raters has classified an item into c and k, i.e.,

\begin{matrix} o_{ck} = \sum_{i = 1}^{n} \sum_{r_{1} \neq r_{2}} 1 [x_{i r_{1}} = c, x_{i r_{2}} = k] . \end{matrix}

Then $N = \sum_{c, k} o_{ck}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$N=\sum _{c,k}o_{ck}$$\end{document} and ${\hat{D}}_{d} = N^{- 1} \sum_{c, k} o_{ck} d (c, k) .$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\hat{D}_{d}=N^{-1}\sum _{c,k}o_{ck}d(c,k).$$\end{document} Moreover, define $n_{c}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$n_{c}$$\end{document} as the number of items classified as c. Then $n_{c} = \sum_{k} o_{ck}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$n_{c}=\sum _{k}o_{ck}$$\end{document} , $\sum_{c} n_{c} = N$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\sum _{c}n_{c}=N$$\end{document} , and $\sum_{c, k} n_{c} n_{k} d (c, k) = N^{2} {\hat{F}}_{d} .$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\sum _{c,k}n_{c}n_{k}d(c,k)=N^{2}\hat{F}_{d}.$$\end{document}

Proposition 3

Using the above definitions, ${\hat{α}}_{d} = {\hat{π}}_{d} + \frac{1}{N} (1 - {\hat{π}}_{d})$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\hat{\alpha }_{d}=\hat{\pi }_{d}+\frac{1}{N}(1-\hat{\pi }_{d})$$\end{document} . Since there are $N = 2 R n$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$N=2Rn$$\end{document} rating pairs in the rectangular setup used in Sect. 2, ${\hat{α}}_{d} = {\hat{π}}_{d} + \frac{1}{2 R n} (1 - {\hat{π}}_{d})$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\hat{\alpha }_{d}=\hat{\pi }_{d}+\frac{1}{2Rn}(1-\hat{\pi }_{d})$$\end{document} in that case.

Proof

The definition of ${\hat{α}}_{d}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\hat{\alpha }_{d}$$\end{document} can be found on Krippendorff (Reference Krippendorff2018, p.235),

\begin{matrix} {\hat{α}}_{d} = 1 - (N - 1) \frac{\sum_{c \neq k} o_{ck} d (c, k)}{\sum_{c \neq k} n_{c} n_{k} d (c, k)} . \end{matrix}

From the above definitions, and the fact that $d (c, k) = 0$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$d(c,k)=0$$\end{document} when $c = k$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$c=k$$\end{document} , we find that

\begin{matrix} \sum_{c \neq k} o_{ck} d (c, k) = \sum_{c, k} o_{ck} d (c, k) = N {\hat{D}}_{d} . \end{matrix}

In the same way,

\begin{matrix} \sum_{c \neq k} n_{c} n_{k} d (c, k) = \sum_{c, k} n_{c} n_{k} d (c, k) = N^{2} {\hat{F}}_{d} . \end{matrix}

Thus,

\begin{matrix} {\hat{α}}_{d} = 1 - \frac{(N - 1)}{N} \frac{{\hat{D}}_{d}}{{\hat{F}}_{d}} = 1 - \frac{{\hat{D}}_{d}}{{\hat{F}}_{d}} + \frac{1}{N} \frac{{\hat{D}}_{d}}{{\hat{F}}_{d}}, \end{matrix}

and using that ${\hat{π}}_{d} = 1 - \frac{{\hat{D}}_{d}}{{\hat{F}}_{d}}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\hat{\pi }_{d}=1-\frac{\hat{D}_{d}}{\hat{F}_{d}}$$\end{document} , we are done. $□$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\square $$\end{document}

Proof of Correspondence with Gwet (Reference Gwet2021)

Using the nominal disagreement function, Gwet (Reference Gwet2021) uses the following estimator for the asymptotic variance of the pairwise Fleiss’ kappa:

\begin{matrix} {\hat{σ}}^{2} = \frac{1}{n - 1} \sum_{i = 1}^{n} {(κ_{i}^{⋆} - \hat{κ})}^{2} . \end{matrix}

Translating into our notation (dropping the dependence on the disagreement d), we have that $\hat{κ} = 1 - \hat{D} / \hat{F}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\hat{\kappa }=1-\hat{D}/\hat{F}$$\end{document} . Moreover, one can verify that $κ_{i}^{⋆}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\kappa _{i}^{\star }$$\end{document} equals

\begin{matrix} κ_{i}^{⋆} = 1 - \frac{\hat{μ} (x_{i})}{\hat{F}} - 2 \frac{\hat{D}}{\hat{F}} (1 - \frac{{\hat{μ}}_{F} (x_{i})}{\hat{F}}), \end{matrix}

where $\hat{μ} (x_{i})$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\hat{\mu }(x_i)$$\end{document} and ${\hat{μ}}_{F} (x_{i})$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\hat{\mu }_F(x_i)$$\end{document} were defined in Sect. 4.

Following a small reorganization of the terms, we find that

\begin{matrix} \frac{1}{n - 1} \sum_{i = 1}^{n} {(κ_{i}^{⋆} - \hat{κ})}^{2} = \frac{1}{{\hat{F}}^{2}} \frac{1}{n - 1} \sum_{i = 1}^{n} {(2 \frac{\hat{D}}{\hat{F}} ({\hat{μ}}_{F} (x_{i}) - \hat{F}) - [{\hat{μ}}_{d} (x_{i}) - \hat{D}])}^{2} . \end{matrix}

Using the definitions of ${\hat{σ}}_{D}^{2}, {\hat{σ}}_{FD}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\hat{\sigma }_{D}^{2},\hat{\sigma }_{FD}$$\end{document} and ${\hat{σ}}_{F}^{2}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\hat{\sigma }_{F}^{2}$$\end{document} (c.f. Section 4.2), one can verify using simple algebraic manipulations that

\begin{matrix} \frac{1}{n - 1} \sum_{i = 1}^{n} {(κ_{i}^{⋆} - \hat{κ})}^{2} = \frac{1}{{\hat{F}}^{2}} ({\hat{σ}}_{D}^{2} - 2 {\hat{σ}}_{FD} \frac{{\hat{D}}_{d}}{{\hat{F}}_{d}} + {\hat{σ}}_{F}^{2} \frac{{\hat{D}}_{d}^{2}}{{\hat{F}}_{d}^{2}}); \end{matrix}

hence, the estimator of Gwet (Reference Gwet2021) is a special case of the proposed estimator in Sect. 4.2.

Simulation of Fleiss’s Kappa

Here are the results of the simulation study in 5.1 for Fleiss’s kappa (Tables 8, 9, 10).

Footnotes

Supplementary Information The online version contains supplementary material available at https://doi.org/10.1007/s11336-023-09945-2.

1 For instance, Fleiss (Reference Fleiss1971), in his paper introducing Fleiss’ kappa, removed several ratings from this data to make sure the total number of ratings was 6 for each item.

2 Note that the concordance correlation coefficient is an intraclass correlation coefficient, see (Carrasco & Jover, Reference Carrasco and Jover2003, p. 850).

3 The Schuster–Smith coefficient also encompasses the case of 2<g<R \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$2<g<R$$\end{document} provided their weight function v(s) is appropriately defined, see the discussion on dispersion weights in (Schuster and Smith, Reference Schuster and Smith2005).

4 The Fréchet mean and variances are usually defined slightly differently, using l2(x,xk) \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$l^{2}(x,x_{k})$$\end{document} instead of l(x,xk) \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$l(x,x_{k})$$\end{document} , with l being a metric. Our definition of the Fréchet mean is sometimes called the generalized Fréchet mean or the α \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\alpha $$\end{document} -Fréchet mean.

5 We use fewer simulations (1, 000) than in the previous simulation (10, 000) since estimation is far more computationally expensive when dealing with continuous data, as it does not allow for binning.

Publisher's Note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

References

Berry, K. J., Johnston, J. E., Mielke, P. W. Jr. (2008). Weighted kappa for multiple raters. Perceptual and Motor Skills, 107(3), 837–848.CrossRef Google Scholar PubMed

Berry, K. J., Mielke, P. W.. (1988). A generalization of Cohen’s kappa agreement measure to interval measurement and multiple raters. Educational and Psychological Measurement, 48(4), 921–933.CrossRef Google Scholar

Carrasco, J. L., Jover, L.. (2003). Estimating the generalized concordance correlation coefficient through variance components. Biometrics, 59(4), 849–858.CrossRef Google Scholar PubMed

Cicchetti, D. V., Feinstein, A. R.. (1990). High agreement but low kappa: II. Resolving the paradoxes. Journal of Clinical Epidemiology, 43(6), 551–558.CrossRef Google Scholar PubMed

Cohen, J.. (1960). A coefficient of agreement for nominal scales. Educational and Psychological Measurement, 20(1), 37–46.CrossRef Google Scholar

Cohen, J.. (1968). Weighted kappa: Nominal scale agreement with provision for scaled disagreement or partial credit. Psychological Bulletin, 70(4), 213–220.CrossRef Google Scholar PubMed

Cohen, M. B. , Lee, Y. T. , Miller, G. , Pachocki, J., & Sidford, A. (2016). Geometric median in nearly linear time. In Proceedings of the forty-eighth annual ACM symposium on theory of computing (pp. 9–21). Association for Computing Machinery.CrossRef Google Scholar

Conger, A. J.. (1980). Integration and generalization of kappas for multiple raters. Psychological Bulletin, 88(2), 322–328.CrossRef Google Scholar

Cooil, B., Rust, R. T.. (1994). Reliability and expected loss: A unifying principle. Psychometrika, 59(2), 203–216.CrossRef Google Scholar

Drezner, Z., Klamroth, K., Schöbel, A., & Wesolowsky, G. O. (2002). The weber broblem. In Z. Drezner & H. Horst (Eds.), Facility location: Applications and theory (pp. 1–36). Springer.Google Scholar

Dubey, P., Müller, H. G.. (2019). Fréchet analysis of variance for random objects. Biometrika, 106(4), 803–821.CrossRef Google Scholar

Efron, B.. (1987). Better bootstrap confidence intervals. Journal of the American Statistical Association, 82(397), 171–185.CrossRef Google Scholar

Fisher, R. A.. (1915). Frequency distribution of the values of the correlation coefficient in samples from an indefinitely large population. Biometrika, 10(4), 507–521.Google Scholar

Fleiss, J. L.. (1971). Measuring nominal scale agreement among many raters. Psychological Bulletin, 76(5), 378–382.CrossRef Google Scholar

Fréchet, . (1948). Les éléments aléatoires de nature quelconque dans un espace distancié. Annales de l’institut Henri Poincaré, 10(4), 215–230.Google Scholar

Gwet, K. L.. (2008). Computing inter-rater reliability and its variance in the presence of high agreement. The British Journal of Mathematical and Statistical Psychology, 61, 29–48.CrossRef Google Scholar PubMed

Gwet, K. L. (2014). Handbook of inter-rater reliability. Advanced Analytics, LLC.Google Scholar

Gwet, K. L. (2021). Large-sample variance of fleiss generalized kappa. Educational and Psychological Measurement.CrossRef Google Scholar

Hoeffding, W. (1992). A class of statistics with asymptotically normal distribution. In: S. Kotz & N. L. Johnson (eds), Breakthroughs in statistics: Foundations and basic theory (pp. 308–334). Springer.Google Scholar

Huber, P. J.. (1964). Robust estimation of a location parameter. Annals of Mathematical Statistics, 35(1), 73–101.CrossRef Google Scholar

Hubert, L.. (1977). Kappa revisited. Psychological Bulletin, 84(2), 289–297.CrossRef Google Scholar

Janson, H., Olsson, U.. (2001). A measure of agreement for interval or nominal multivariate observations. Educational and Psychological Measurement, 61(2), 277–289.CrossRef Google Scholar

Korolyuk, V. S., & Borovskich, Y. V. (2013). Theory of U-statistics (1994th ed.). Springer.Google Scholar

Krippendorff, K.. (1970). Bivariate agreement coefficients for reliability of data. Sociological Methodology, 2, 139–150.CrossRef Google Scholar

Krippendorff, K. (2018). Content analysis: An introduction to its methodology.CrossRef Google Scholar

Lee, A. J. (2019). U-statistics: Theory and practice. Routledge.Google Scholar

Lehmann, E. L. (2004). Elements of large-sample theory. Springer.Google Scholar

Light, R. J.. (1971). Measures of response agreement for qualitative data: Some generalizations and alternatives. Psychological Bulletin, 76(5), 365–377.CrossRef Google Scholar

Lin, L. I.. (1989). A concordance correlation coefficient to evaluate reproducibility. Biometrics, 45(1), 255–268.CrossRef Google Scholar PubMed

Lin, L. I. (1992). Assay validation using the concordance correlation coefficient. Biometrics, 48(2), 599–604.CrossRef Google Scholar

Martín Andrés, A., Álvarez Hernández, M.. (2020). Hubert’s multi-rater kappa revisited. The British Journal of Mathematical and Statistical Psychology, 73(1), 1–22.CrossRef Google Scholar PubMed

Maxwell, A. E.. (1977). Coefficients of agreement between observers and their interpretation. The British Journal of Psychiatry, 130, 79–83.CrossRef Google Scholar PubMed

Moss, J.. (2023). Measuring agreement using guessing models and knowledge coefficients. Psychometrika,.CrossRef Google Scholar PubMed

O’Connell, D. L., Dobson, A. J.. (1984). General Observer-Agreement measures on individual subjects and groups of subjects. Biometrics, 40(4), 973–983.CrossRef Google Scholar

Perreault, W. D., Leigh, L. E.. (1989). Reliability of nominal data based on qualitative judgments. Journal of Marketing Research, 26(2), 135–148.CrossRef Google Scholar

Sandifer, M. G., Hordern, A., Timbury, G. C., Green, L. M.. (1968). Psychiatric diagnosis: A comparative study in north Carolina, London and Glasgow. The British Journal of Psychiatry, 114(506), 1–9.CrossRef Google Scholar PubMed

Schouten, H. J. A.. (1980). Measuring pairwise agreement among many observers. Biometrical Journal, 22(6), 497–504.CrossRef Google Scholar

Schouten, H. J. A.. (1982). Measuring pairwise agreement among many observers. II. Some improvements and additions. Biometrical Journal, 24(5), 431–435.CrossRef Google Scholar

Schuster, C., Smith, D. A.. (2005). Dispersion-weighted kappa: An integrative framework for metric and nominal scale agreement coefficients. Psychometrika,.CrossRef Google Scholar

Scott, W. A.. (1955). Reliability of content analysis: The case of nominal scale coding. Public Opinion Quarterly, 19(3), 321–325.CrossRef Google Scholar

Serfling, R. J. (1980). Approximation theorems of mathematical statistics. Wiley.CrossRef Google Scholar

van Oest, R.. (2019). A new coefficient of interrater agreement: The challenge of highly unequal category proportions. Psychological Methods, 24(4), 439–451.CrossRef Google Scholar PubMed

Varian, H. R. (1975). A Bayesian approach to real estate assessment. In: A. Z. Stephen & E. Fienberg (Eds.), Studies in Bayesian econometric and statistics in honor of Leonard J. Savage (pp. 195–208). North Holland.Google Scholar

Warrens, M. J.. (2012). Equivalences of weighted kappas for multiple raters. Statistical Methodology, 9(3), 407–422.CrossRef Google Scholar

Warton, D. I., Hui, F. K. C.. (2011). The arcsine is asinine: The analysis of proportions in ecology. Ecology, 92(1), 3–10.CrossRef Google Scholar PubMed

Zapf, A., Castell, S., Morawietz, L., Karch, A.. (2016). Measuring inter-rater reliability for nominal data—Which coefficients and confidence intervals are appropriate?. BMC Medical Research Methodology, 16, 93.CrossRef Google Scholar PubMed

Table 1 Weighted agreement coefficients.

Table 2 Maximal agreement for the data of Fleiss (1971).

Figure 1 Simulated sampling distribution of κ^d\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\hat{\kappa }_{d}$$\end{document} for quadratic weights using three transformations, n=20,R=3\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$n=20, R=3$$\end{document}. The simulation setup is described in Example 3. The arcsine transform makes the sampling distribution closest to the normal distribution.

Table 3 Confidence intervals for the data of Fleiss (1971) using the arcsine method.

Table 4 Confidence intervals for Zapf et al. (2016) using the arcsine method.

Table 5 Coverage (first entry) and lengths (second entry) of confidence intervals: Perreault–Leigh model, Cohen’s kappa.

Table 6 Coverage (first entry) and lengths (second entry) of confidence intervals: normal model, Cohen’s kappa.

Table 7 Coverage (first entry) and lengths (second entry) of confidence intervals for g-wise coefficients: Perreault–Leigh model, Cohen’s kappa.

Figure 2 Sample distribution of κ^d\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\hat{\kappa }_{d}$$\end{document} for nominal (left) and absolute value (right) weights. Both plots omit a dominating spike at 1. Here n=20\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$n=20$$\end{document} and j=5\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$j=5$$\end{document}, and we use the Perreault–Leigh model (same parameters as in Sect. 5.1) to simulate the data. There were 2573 unique values for the nominal weight and 8790 unique values for the absolute value weight after N=200,000\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$N=200{,}000$$\end{document} simulations.

Table 8 Coverage (first entry) and lengths (second entry) of confidence intervals: Perreault–Leigh model, Fleiss’s kappa.

Table 9 Coverage (first entry) and lengths (second entry) of confidence intervals: Normal model, Fleiss’s kappa.

Table 10 Coverage (first entry) and lengths (second entry) of confidence intervals: Perreault–Leigh model, Fleiss’ kappa (R=5\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$R=5$$\end{document}).

Moss supplementary material

File 5.5 MB

Article contents

Measures of Agreement with Multiple Raters: Fréchet Variances and Inference

Abstract

Keywords

1. Introduction

2. Measures of Agreement

Definition 1

3. Sample Estimates

3. Fréchet Variances for g-Wise Agreement Coefficients

Example 1

Example 2

4. Inference

4.1. Limit Theory Using U-Statistics

Lemma 1

Proposition 1

4.2. Estimating the Variances

4.3. Improving Approximate Normality with the Arcsine and Fisher Transforms

Example 3

5. Confidence Intervals

Example 4

Example 5

5.1. Simulation of Confidence Sets When g=2 \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$g=2$$\end{document}

5.1.1. A Perreault–Leigh Model

5.1.2. Normal Model

5.2. Simulation of Confidence Sets when g≠2 \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$g\ne 2$$\end{document}

6. Concluding Remarks

Funding

Appendix

Agreement Versus Disagreement

Proof

True Values in the Normal Simulation

Variance of U-Statistics

Proposition 2

Proof

Expanding the Definitions

Krippendorff’s Alpha

Proposition 3

Proof

Proof of Correspondence with Gwet (Reference Gwet2021)

Simulation of Fleiss’s Kappa

Footnotes

References

Moss supplementary material

Save article to Kindle

Save article to Dropbox

Save article to Google Drive

Reply to: Submit a response

Your details

You have entered the maximum number of contributors

Conflicting interests

5.1. Simulation of Confidence Sets When $g = 2$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$g=2$$\end{document}