No CrossRef data available.
Article contents
Repeatability, Reproducibility, and Diagnostic Accuracy of a Commercial Large Language Model (ChatGPT) to Perform Disaster Triage Using the Simple Triage and Rapid Treatment (START) Protocol
Published online by Cambridge University Press: 31 October 2024
Abstract
The release of ChatGPT in November 2022 drastically lowered the barrier to artificial intelligence with an intuitive web-based interface to a large language model. This study addressed the research problem: “Can ChatGPT adequately triage simulated disaster patients using the Simple Triage and Rapid Treatment (START) tool?”
Five trained disaster medicine physicians developed nine prompts. A Python script queried ChatGPT Version 4 with each prompt combined with 391 validated patient vignettes. Ten repetitions of each combination were performed: 35190 simulated triages.
A valid START score was returned In 35102 queries (99.7%). There was considerable variability in the results. Repeatability (use of the same prompt repeatedly) was responsible for 14.0% of overall variation. Reproducibility (use of different prompts) was responsible for 4.1% of overall variation. Accuracy of ChatGPT for START was 61.4% with a 5.0% under-triage rate and a 33.6% over-triage rate. Accuracy varied by prompt between 45.8% and 68.6%.
This study suggests that the current ChatGPT large language model is not sufficient for triage of simulated patients using START due to poor repeatability and accuracy. Medical practitioners should be aware that while ChatGPT can be a valuable tool, it may lack consistency and may provide false information.
- Type
- Abstract
- Information
- Copyright
- © The Author(s), 2024. Published by Cambridge University Press on behalf of Society for Disaster Medicine and Public Health, Inc.