Quantifying rater drift on the HAM-D in a sample of standardized rater training events: Implications for reliability and sample size calculations

B. Rothman; C. Yavorsky; A. De Fries; J. Gordon; M. Opler

doi:10.1016/S0924-9338(11)72389-6

Quantifying rater drift on the HAM-D in a sample of standardized rater training events: Implications for reliability and sample size calculations

Published online by Cambridge University Press: 16 April 2020

B. Rothman ,

J. Gordon and

B. Rothman: Affiliation:
ProPhase LLC, New York, NY, USA
C. Yavorsky: Affiliation:
ProPhase LLC, New York, NY, USA
A. De Fries: Affiliation:
ProPhase LLC, New York, NY, USA
J. Gordon: Affiliation:
ProPhase LLC, New York, NY, USA
M. Opler: Affiliation:
ProPhase LLC, New York, NY, USA New York University, New York, NY, USA

Article contents

Abstract

Abstract

Core share and HTML view are not available for this content. However, as you have access to this content, a full PDF is available via the ‘Save PDF’ action button.

Introduction/objectives/aims

Though rater drift in clinical trials has long been understood to negatively impact trial results, few studies have systematically quantified this. We examined training data for the HAM-D (Hamilton Depression Scale, 17-item version) at two time points to measure the impact.

Methods

Raters participating in a standardized training scored the HAM-D based on two videotaped interviews of depressed patients. To assess drift, data from an initial, post-online training session was compared to data obtained 12 months later. Intra-class correlation coefficients (Shrout & Fliess, 1979) and concordance with expert ratings were compared.

Results

Intra-class correlation coefficients (ICC) for raters (n = 167) following initial training were good to excellent for individual raters (.695–.976, p < .0001) and good for the overall cohort (.752, p < .0001). Concordance with expert ratings was excellent at 99.3%. The overall ICC fell to .730 at the second assessment and although the upper bound of individual performance remained in the good to excellent range, the frequency of scores in the poor to fair range (< .65) increased. Concordance also fell slightly to 87%.

Conclusions

Rater drift occurred over 12 months, as gauged by the metrics of reliability and concordance. Drift was apparent in a limited portion of the cohort but resulted in a lower overall ICC at the second time point. Because studies are generally powered assuming that the ICC remains stable, there are implications for both this power calculation and the required sample size.

Type: P02-88
Information: European Psychiatry , Volume 26 , Issue S2: Abstracts of the 19th European Congress of Psychiatry , March 2011 , pp. 683

DOI: https://doi.org/10.1016/S0924-9338(11)72389-6 [Opens in a new window]

Submit a response

Comments

No Comments have been published for this article.

Article contents

Quantifying rater drift on the HAM-D in a sample of standardized rater training events: Implications for reliability and sample size calculations

Abstract

Comments

Save article to Kindle

Save article to Dropbox

Save article to Google Drive

Reply to: Submit a response

Your details

You have entered the maximum number of contributors

Conflicting interests