Repeatability, Reproducibility, and Diagnostic Accuracy of a Commercial Large Language Model (ChatGPT) to Perform Disaster Triage Using the Simple Triage and Rapid Treatment (START) Protocol

Jeffrey Michael Franc; Atilla Hertelendy; Lenard Cheng; Ryan Hata; Manuela Verde

doi:10.1017/dmp.2024.194

Repeatability, Reproducibility, and Diagnostic Accuracy of a Commercial Large Language Model (ChatGPT) to Perform Disaster Triage Using the Simple Triage and Rapid Treatment (START) Protocol

Published online by Cambridge University Press: 31 October 2024

Jeffrey Michael Franc ,

Ryan Hata and

Jeffrey Michael Franc: Affiliation:
University of Alberta, Edmonton, AB, Canada Universita’ del Piemonte Orientale, Novara, NO, Italy
Atilla Hertelendy: Affiliation:
Beth Israel Deaconess Medical Center, Boston, MA, USA
Lenard Cheng: Affiliation:
Beth Israel Deaconess Medical Center, Boston, MA, USA
Ryan Hata: Affiliation:
Beth Israel Deaconess Medical Center, Boston, MA, USA
Manuela Verde: Affiliation:
Universita’ del Piemonte Orientale, Novara, NO, Italy

Article contents

Abstract

Rights & Permissions

Abstract

Core share and HTML view are not available for this content. However, as you have access to this content, a full PDF is available via the ‘Save PDF’ action button.

Objective

The release of ChatGPT in November 2022 drastically lowered the barrier to artificial intelligence with an intuitive web-based interface to a large language model. This study addressed the research problem: “Can ChatGPT adequately triage simulated disaster patients using the Simple Triage and Rapid Treatment (START) tool?”

Methods

Five trained disaster medicine physicians developed nine prompts. A Python script queried ChatGPT Version 4 with each prompt combined with 391 validated patient vignettes. Ten repetitions of each combination were performed: 35190 simulated triages.

Results

A valid START score was returned In 35102 queries (99.7%). There was considerable variability in the results. Repeatability (use of the same prompt repeatedly) was responsible for 14.0% of overall variation. Reproducibility (use of different prompts) was responsible for 4.1% of overall variation. Accuracy of ChatGPT for START was 61.4% with a 5.0% under-triage rate and a 33.6% over-triage rate. Accuracy varied by prompt between 45.8% and 68.6%.

Conclusions

This study suggests that the current ChatGPT large language model is not sufficient for triage of simulated patients using START due to poor repeatability and accuracy. Medical practitioners should be aware that while ChatGPT can be a valuable tool, it may lack consistency and may provide false information.

Type: Abstract
Information: Disaster Medicine and Public Health Preparedness , Volume 18 , 2024 , e183

DOI: https://doi.org/10.1017/dmp.2024.194 [Opens in a new window]

Franc et al. supplementary material

File 289.4 KB

Article contents

Repeatability, Reproducibility, and Diagnostic Accuracy of a Commercial Large Language Model (ChatGPT) to Perform Disaster Triage Using the Simple Triage and Rapid Treatment (START) Protocol

Abstract

Franc et al. supplementary material

Save article to Kindle

Save article to Dropbox

Save article to Google Drive

Reply to: Submit a response

Your details

You have entered the maximum number of contributors

Conflicting interests