Introduction
Large language models, or chatbots, such as Chat Generative Pretrained Transformer (ChatGPT) use language processing to generate conversational responses to written inputs.1 ChatGPT, a free online tool trained on millions of pages of data from across the internet with data current to September 2021, has made substantial inroads into the field of medicine, even proving its ability to pass the US Medical Licensing Examination.Reference Gilson, Safranek, Huang, Socrates, Chi and Taylor2 It is the fastest growing consumer application to date, having reached over 100 million users by January 2023.Reference Milmo3
Approximately 80 per cent of internet users search online for health information.Reference Yee, Modiri, Shi and Hsiao4 Although ChatGPT seems to have the potential to upend medical care, providing patients with more data than a simple online search in a language that a layman can understand, it has limitations: it can give users a different answer depending on input phrasing, it may write plausible-sounding but incorrect or nonsensical answers and it has been noted to perpetuate disparities and biases in race, sex and culture.Reference Li, Moon, Purkayastha, Celi, Trivedi and Gichoya5–Reference Abid, Farooqi and Zou7 As medical providers, we should be up to date with the online tools available to our patients and be able to provide our opinion on the information available, and we anticipate patients may be seeking information on ChatGPT.
Responses generated by ChatGPT in response to clinical vignettes have also been compared with those of physicians in terms of diagnostic accuracy and treatment plans, specifically within the otolaryngology field. Physicians tended to highly agree with the differential diagnoses and treatment plans generated by ChatGPT.Reference Qu, Qureshi, Petersen and Lee8
With the further integration of artificial intelligence (AI) into our society, especially via large language models such as ChatGPT, this study intended to determine its utility as a patient resource. Patient information and resources available to families can be overwhelming and daunting. A language model designed to respond in a conversational tone to any question promises great potential in simplifying the patient experience. Confirming that the recommendations and advice align with recommendations from accredited sources is necessary before endorsing this resource to patients and families.
Materials and methods
This study aimed to investigate the utility of ChatGPT as a patient resource. Four common paediatric otolaryngology conditions were studied: snoring, sleep apnoea, treatment of sleep apnoea and ear wax (cerumen) impaction. Two questions for each condition were entered into ChatGPT version 3.5, with the questions for each condition varied slightly to test for consistency. The responses generated by ChatGPT were then compared with the top internet page recommended by an online search engine in the following domains: readability (Flesch Reading Ease score and word count), expediency (time taken to generate response), validity (comparison of recommendations to American Academy of Otolaryngology Head and Neck Surgery (AAO-HNS) guidelines) and consistency (changes in recommendations based on alterations in the question).
On 19 May 2023, inputs in the form of questions were entered into ChatGPT and an online search engine (Google) as shown in Table 1. Two independent otolaryngologists tested the validity of the responses against the AAO-HNS recommendations and inter-relater reliability was assessed using Cohen's kappa test. Descriptive statistics were summarised with means and standard deviations for continuous variables. The Mann–Whitney U test was used to assess differences between the responses generated by ChatGPT and the top webpage recommended by the search engine. The Mann–Whitney U test was designed to test whether there were significant differences in the distribution of a continuous variable (Flesch Reading Ease score, word count, time taken (seconds) to generate response and validity score of 0–3 based on comparison with recommendations from the AAO-HNS) between generated responses from ChatGPT and the top web pages recommended by the search engine. A p value of less than 0.05 was considered statistically significant.
ChatGPT = Chat Generative Pre-trained Transformer
This study was exempt from review by the Connecticut Children's Medical Center Institutional Review Board because it does not constitute human subject research.
Results and analysis
Outputs from ChatGPT and the top web page recommended by the search engine were obtained on 19 May 2023.
Readability was characterised by two measures, the Flesch Reading Ease score and word count. The mean Flesch Reading Ease score for ChatGPT was 44.9 (college level), with a standard deviation of 8.05. The mean Flesch Reading Ease score for the internet-generated sources was 57.55 (10th- to 12th-grade or high school) with a standard deviation of 10.46. ChatGPT had a significantly more difficult Flesch Reading Ease score than the internet sources (Table 2). ChatGPT also generated significantly fewer words (Table 2).
ChatGPT = Chat Generative Pre-trained Transformer
Expediency was measured by time taken to generate a response for ChatGPT and the time taken to reach top internet search engine recommended webpage. ChatGPT was more expeditious in generating a response (Table 3).
ChatGPT = Chat Generative Pre-trained Transformer
Validity was measured by comparison of responses to guidelines from the AAO-HNS (Table 1). The guidelines were analysed for key components. Three key components were determined for each condition, with one point assigned for each component, such that a score of 3 suggested complete validity. Two independent otolaryngologists generated responses from ChatGPT for each condition and assigned validity scores for both the ChatGPT responses and the search engine recommended web pages.
For the topics ‘snoring in children’ and ‘sleep apnoea in children’, the following components were considered necessary for full validity: (1) an accurate definition of obstructive sleep apnoea, (2) an accurate list of symptoms and causes for concern, and (3) validated treatments and a recommendation to see a provider. For ‘treatment of sleep apnoea’, the following three components were deemed necessary for full validity: (1) an accurate explanation of surgical treatments, (2) an accurate explanation of medical treatments, and (3) a recommendation to see a provider. For ‘ear wax in children’, the following three components were considered necessary for full validity: (1) an accurate definition of ear wax and/or cerumen, (2) a recommendation to see a provider, and (3) a warning against home remedies.
The mean validity score for ChatGPT was 2.75, with a standard deviation of 0.45, and the mean validity score for the internet sources was 3, with a standard deviation of 0. There was no statistically significant difference between the validity of the responses (p = 0.234), meaning both sources provided valid responses. Inter-relater reliability was measured using Cohen's kappa test and moderate agreement (95.83 per cent) was found between the two resources (Cohen's kappa = 0.48), meaning that there was general agreement between users on the validity of the ChatGPT responses and the web pages.
To assess the consistency between the responses, the input into ChatGPT was varied slightly for each topic, as shown in Table 1. The validity scores were then compared between the initial question and the varied question. The mean validity score for the initial question was 2.875, with a standard deviation of 0.354, and the mean validity score for the varied question was 2.625, with a standard deviation of 0.518. There was no significant difference in the validity of responses generated for the slightly varied questions, indicating consistency in responses (p = 0.430).
Discussion
This study examined the utility of ChatGPT as a resource for patients and their families. In comparison with recommendations from the AAO-HNS, ChatGPT responses demonstrated validity on a par with the top recommended webpages on the internet. The integration of large language models, such as ChatGPT, has elevated the role of AI in disseminating healthcare information. These findings support the notion that ChatGPT can serve as a reliable patient resource.
Not only did the ChatGPT responses compare favourably to internet material, but they also consistently aligned with accredited recommendations from the AAO-HNS. The use of ChatGPT as a patient resource is substantiated by existing literature, but it is not without limitations. While ChatGPT's post-operative instructions for specific procedures have been found to be equivalent to institutional recommendations, they were found to be less understandable and actionable.Reference Ayoub, Lee, Grimm and Balakrishnan9 Hence, it is crucial to emphasise that ChatGPT should not be used as a replacement for a physician's guidance.
ChatGPT holds promise as a source of information for patients, provided it is used judiciously. It has demonstrated its ability to exercise clinical judgment and offer medical diagnoses and treatment plans when presented with clinical vignettes incorporating medical jargon, relevant history, physical examinations and diagnostic findings.Reference Qu, Qureshi, Petersen and Lee8 These capabilities have been observed to yield highly accurate differential diagnoses and reasonable treatment plans.Reference Qu, Qureshi, Petersen and Lee8
In terms of accessibility and timeliness, this study affirms that ChatGPT is an accessible and user-friendly platform. Its ability to generate concise yet valid responses is advantageous for patients and families. However, it is worth noting that the readability level of ChatGPT is significantly higher than that of the top recommended internet materials, potentially limiting its accessibility for individuals who have not pursued higher education. This presents a notable limitation.
Although large language models such as ChatGPT have shown promise as a patient resource, there are limitations. It generates responses based on patterns learned from extensive datasets, which are only up to date as of September 2021 at this time. Consequently, there is a risk of ChatGPT providing outdated information or not reflecting the latest recommendations. It is important to emphasise that during this study, every response from ChatGPT recommended consulting a healthcare professional. Similar practices have been observed in other healthcare studies involving ChatGPT.Reference Vaira, Lechien, Abbate, Allevi, Audino and Beltramini10
It is evident from current literature that ChatGPT serves as a valuable tool in medicine but that it should not replace the expertise and clinical judgment of medical professionals. For example, the quality of responses from ChatGPT was inferior to that of a second-year resident in terms of both accuracy and completeness when responding to clinical questions and scenarios in the subspecialty of head and neck surgery.Reference Vaira, Lechien, Abbate, Allevi, Audino and Beltramini10
Ethical considerations also come into play. ChatGPT has demonstrated bias in previous studies, potentially perpetuating stereotypes and misinformation, and should be used with caution.Reference Li, Moon, Purkayastha, Celi, Trivedi and Gichoya5–Reference Abid, Farooqi and Zou7 User privacy is also a concern, especially as the model incorporates prior questions into future responses and can process sensitive healthcare information, including personal details and medical records if entered by patients into the chatbox.Reference Zhang and Zhang11
As large language models and AI continue to evolve, particularly in the field of medicine, it becomes imperative to establish guidelines and quality control measures for AI-driven healthcare. One of the medicolegal implications that requires attention is accountability in cases where incorrect information or recommendations lead to patient harm.Reference Dave, Athaluri and Singh12
This study affirms that ChatGPT is a valid resource for patients, demonstrating comparability with the top internet-recommended sources and AAO-HNS guidelines in the ENT areas of snoring in children, sleep apnoea in children, treatment of sleep apnoea and earwax impaction. However, there are several limitations of this study. The investigation focused on only four highly specific topics within the field of paediatric otolaryngology, limiting the generalisability even within the field of ENT. Moreover, the questions posed to ChatGPT were straightforward, mirroring the types of questions patients and families are likely to ask.
• Chat Generative Pre-trained Transformer (ChatGPT) has been shown to be an effective patient resource
• ChatGPT delivers concise, quick and valid responses to commonly asked patient questions in paediatric ENT, but responses are generated at a higher reading level than that found in online resources
• ChatGPT is an accessible and user-friendly platform that can provide tailored responses to simple questions posed by patients and families
Future research should explore more complex, high-level inquiries to better assess validity, but for the purpose of this study, basic questions were chosen to test the utility of ChatGPT as a patient resource. The number of outputs which were analysed totalled 16, considerably limiting the power of this study. Further studies that analyse a larger collection of responses are needed to validate this resource.
Conclusion
This study represents one of the first efforts to assess the validity of ChatGPT as a resource for patient information in otolaryngology. It highlights the potential of AI integration in healthcare to streamline information delivery and provide tailored, prompt responses to patients and families. While AI, such as ChatGPT, has yet to fully replicate the clinical expertise, judgment and skill of trained physicians, it is making significant strides in the field of medicine. This progress invites critical examination of ethical, medicolegal and scientific aspects of this resource.
Competing interests
None declared