Introduction
Recent public release of novel conversational bots powered by artificial intelligence (AI) algorithms have resulted in rapid and continued growth of academic interest and ignited wide debates concerning the possible impact of these tools on society and research. Reference Haleem, Javaid and Singh1,Reference De Angelis, Baglivo and Arzilli2 These cutting-edge chatbots utilize AI technology called large language models (LLMs). These LLMs are trained on massive amounts of text data to produce new, fluent, human-like text in response to a user input by predicting and repeatedly generating the next word in a sentence based on the preceding words. Reference Hassani and Silva3 By means of the LLM, the chatbots offer unprecedented opportunities to handle a wide range of natural language processing tasks, including text writing, content summarization, and question answering.
Except for several exploratory studies, Reference Ahn4–Reference Sarbay, Berikol and Özturan9 the LLM-based chatbots currently lack evaluation in terms of perspective application in emergency medicine. In relation to resuscitation research and practice, where implementation of contemporary digital technologies is encouraged, Reference Berg, Cheng and Panchal10,Reference Semeraro, Greif and Böttiger11 it seems important and well-timed to examine the practicability of utilizing the LLM-powered chatbots in two directions: (1) to generate guideline-consistent advice on help in cardiac arrest (for purposes of public resuscitation education or for just-in-time informational support of untrained lay rescuers in a real-life emergency), and thus to contribute towards the promotion of community response to out-of-hospital cardiac arrest; and (2) to evaluate the quality of information on resuscitation available online (that is known to be generally low Reference Liu, Haukoos and Sasson12–Reference Birkun, Gautam, Trunkwala and Böttiger14 ) and suggest how to enhance the content. The latter could help to establish systematic quality surveillance and assurance for publicly available resources on resuscitation and reduce potential harm from misinformation.
Accordingly, this study was commenced to assess the quality of advice on how to give help to a non-breathing victim generated by two prominent LLM-powered chatbots, as well as to test the ability of the chatbots to perform self-rating of their advice and improve quality of the content.
Methods
Study Design
This was a cross-sectional, analytical study based on open-source online services’ data. The study design was informed by previous related research. Reference Birkun and Gautam6,Reference Birkun and Dr15 The chatbots were interrogated in English using the Microsoft Edge web browser (Microsoft Corporation; Redmond, Washington USA) for the new Bing, and Google Chrome web browser (Google LLC; Mountain View, California USA) for Bard, on an Apple macOS Big Sur (Apple Inc.; Cupertino, California USA) operated personal computer. In the chatbots’ settings, the region of search was set as the United Kingdom (UK), and a Virtual Private Network (VPN) was used to simulate search from this country with location set to London. In order to avoid impact of previous user activity on the chatbots’ responses, before each search query, all browsing history, download history, search history, cache, and cookies were cleared from the browsers, Microsoft, and Google accounts. For Bing, the search was made under “More Precise” conversation style.
In May 2023, the chatbots were sequentially inquired (20 times per each chatbot): (1) “What to do if someone is not breathing?”; (2) to rate content of the chatbot’s own response to the first query for compliance with the Resuscitation Council UK (London, England) Guidelines on a 10-point scale (one being very low compliance, ten being very high compliance); (3) to indicate whether the response contains any guideline-noncompliant instructions; and (4) to correct the response to make it fully compliant with the guidelines (Appendix Table A shows literal prompts; available online only). Original and self-corrected chatbot responses containing instructions on help to a non-breathing victim were tabulated and independently manually assessed by the authors for compliance with the 2021 Resuscitation Council UK Guidelines on adult Basic Life Support Reference Perkins, Colquhoun and Deakin16 using an author-developed checklist (Dataset Reference Birkun and Gautam17 ). For each item of the checklist, congruence of the chatbot-generated instructions with the guidelines was rated as True (when checklist item wording was satisfied completely), Partially True (when checklist item wording was satisfied in part), or Not True (when corresponding instruction was missing in the chatbot response). Results of the evaluation provided by both authors were compared, and in case of discrepancies, the authors resolved them by consensus. When a chatbot provided links to the source web articles, the articles’ content was evaluated using the same methodology. Also, the authors independently rated original chatbot responses for compliance with the guidelines using the 10-point scale, and the median expert rating was calculated.
Additionally, original and self-corrected chatbot responses were evaluated for length (number of sentences) and checked for readability based on the Flesch-Kincaid Grade Level (FKGL) Reference Kincaid, Fishburne, Rogers and Chissom18 metric using an open online readability analyzer Datayze. 19 The FKGL formula utilizes the average number of syllables per word and average number of words per sentence to conclude how easy a passage of English text is to read and understand. Reference Kincaid, Fishburne, Rogers and Chissom18 The FKGL values correspond with a United States grade level of education. Lower FKGL values entail greater readability.
The New Bing
The new Bing is an AI-powered web search engine by Microsoft Corporation made available for the public in February 2023. The chatbot functionality of the new Bing allows users to perform web search in a conversational way. It searches for relevant content across the web and consolidates what it finds to generate a summarized answer using a LLM from OpenAI (San Francisco, California USA) known as Generative Pre-Trained Transformer 4 (GPT-4). Reference Peters20 Bing centers its response to a user’s query on high-ranking content from the web. It ranks the content by weighing a set of features, including relevance, quality and credibility, and freshness. 21 To determine quality and credibility of a website, it evaluates clarity of purpose of the site, its usability, presentation, and authoritativeness. The latter includes such factors as author’s or site’s reputation, completeness of the content, and transparency of the authorship. Higher quality is considered for a website containing citations and references to data sources. Bing accompanies its responses with links to search results that were used to ground the response.
Bard
Bard is an AI chatbot launched by Google LLC in March 2023. Similar to the new Bing, to respond to user’s inquiries, it retrieves information from the internet. To produce the responses, Bard utilizes Google’s conversational AI language model called Language Model for Dialogue Applications (LaMDA). 22 The mechanism how Bard ranks its web search results to generate answers is undisclosed. Unlike the new Bing, Bard does not routinely cite sources of information for its responses. 23
The study results were analyzed descriptively. Mann Whitney U Test and Wilcoxon signed-rank test were used to determine differences.
All data that support the findings of this study are openly available in Mendeley Data repository. Reference Birkun and Gautam17
Because the study did not involve human participants, it did not require ethical approval.
Results
Both chatbots comprehended all user queries and provided context-consistent textual responses.
Bing’s responses were considerably shorter than Bard’s responses (Table 1). Readability was higher for Bard’s responses, requiring approximately a sixth-grade level of education to understand the text compared with seventh-eighth-grade level for Bing.
Abbreviation: IQR, interquartile range.
a Bing vs Bard, P <.001;
b Bing vs Bard, P <.050.
Original chatbot responses showed poor coverage of the guideline-consistent instructions on help to a non-breathing victim (Table 2). Essential elements of the bystander action, including assurance of safety, request for and use of an automated external defibrillator (AED), early start, and uninterrupted performance of chest compressions following the recommended technique, were for the most part omitted. Mean percentage of the chatbots’ responses completely satisfying the checklist criteria was 9.5% for Bing and 11.4% for Bard (P >.050).
Abbreviations: AED, automated external defibrillator; EMS, Emergency Medical Services.
The chatbots over-estimated the quality of their responses in terms of compliance with the resuscitation guidelines. Median (interquartile range) self-rating of the original responses amounted 7.0 (7.0–7.0) points for Bing and 9.0 (9.0–9.0) points for Bard, whereas the expert rating was significantly lower (P <.001) — 4.0 (2.0–4.5) and 3.0 (2.6–4.0) points, respectively.
Bing’s original responses were more accurate in terms of suggestion of the search-region-specific Emergency Medical Services (EMS) telephone number. Bing recommended to call the UK national emergency number 9-9-9 in 95.0% (n = 19) of cases, whereas Bard’s advice was always to call the United States national emergency number 9-1-1 or a local (unspecified) emergency number.
When inquired about whether the responses contain any guidelines-inconsistent instructions, both chatbots denied this on all occasions. However, the manual assessment revealed that all Bing and Bard responses included some superfluous instructions which either were inappropriate for an untrained lay rescuer or contradicted current resuscitation guidelines (Table 3). Whereas for Bing, the excessive instructions were limited to unnecessary breathing check and suggestion to give rescue breaths, Bard in 55.0% responses (n = 11) presented one or more seemingly plausible but factually incorrect and commonly potentially harmful statements, representing the phenomenon of “artificial hallucination.” Reference Alkaissi and McFarlane24
Abbreviation: CPR, cardiopulmonary resuscitation.
As for the sources of information contained in the chatbots’ responses, Bing on all occasions cited the same two web articles which demonstrated incomplete adherence with the resuscitation guidelines, omitting important aspects of the life-saving approach (percentage of the checklist items completely or partially satisfied by the content of these web articles was 36.4% and 72.7%; Dataset Reference Birkun and Gautam17 ). Bard did not cite any sources for its responses.
In reply to the request to correct the original responses to ensure full compliance with the guidelines and applicability of the instructions on cardiopulmonary resuscitation (CPR) for untrained rescuers only, both chatbots made adjustments to their responses. Despite some enhancement, quality of the responses did not improve significantly (Table 4). Mean percentage of the chatbots’ responses having complete compliance with the checklist criteria remained low (14.5% for Bing and 24.1% for Bard, P >.050), and superfluous guidelines-inconsistent instructions on many occasions remained in place (Table 3). Bard improved its advice in terms of accuracy of suggestion of the search-region-specific EMS number: the UK emergency number 9-9-9 was recommended in 80.0% (n = 16) self-corrected responses (versus 95.0%, n = 19 for Bing).
Abbreviations: AED, automated external defibrillator; EMS, Emergency Medical Services.
Discussion
Despite the innovative AI-powered question-answering systems seeming to constitute a promising opportunity to engage lay people in provision of help and to improve health outcomes in emergencies, there are little published data on the effectiveness of such systems. Previous studies tested capabilities of voice-based conversational digital assistants (Alexa [Amazon; Seattle, Washington USA], Cortana [Cortana Corp.; Falls Church, Virginia USA], Google Assistant [Google LLC; Mountain View, California USA], and Siri [Apple Inc.; Cupertino, California USA]) Reference Bickmore, Trinh and Olafsson25,Reference Picard, Smith, Picard and Can Alexa26 and Google web search engine’s question-answering system Reference Birkun and Dr15 in responding to inquiries related to first aid in a range of emergency conditions. The studies showed that the AI assistants frequently failed to recommend how to give help, or suggested to take inappropriate actions that could have resulted in harm to a victim. Such poor performance in particular was explained by limitations of the search engine’s AI algorithms, that seem to generate and present responses as literal quotations automatically extracted from a search-engine-indexed webpage that most closely resemble the user’s query. Reference Birkun and Dr15
Current research focused on evaluation of performance of the two flagship LLM-powered chatbots — Bing and Bard — which exercise a fundamentally new approach to question answering. Instead of using the quote-offering as is done by conventional search engine question-answering systems, the LLM chatbots search information online, perform ranking of the information, and utilize a neural network to generate summarized responses based on the high-ranking content. 21,22
The study found that both chatbots at all times correctly recognized user inquiries and provided easily comprehensible responses containing some advice on how to give help to a non-breathing victim. However, quality of the responses’ content in terms of compliance with the resuscitation guidelines was low. Both Bing and Bard omitted essential characteristics of the life-saving help in all responses. In fact, the mean percentage of the chatbots’ responses completely satisfying the guidelines-based checklist criteria was less than 10% for Bing and less than 12% for Bard. For instance, the chatbots never suggested to request an AED, to begin chest compressions as early as possible, or to perform compressions with minimal interruptions. Where the guideline-consistent instructions were given, the chatbots usually did not provide sufficient details on the life-saving technique. In particular, important characteristics of chest compressions, including compression depth and rate, as well as the need to release pressure on the chest after each compression, were missing as a rule. Lack of sufficient details in LLM-powered chatbots’ responses to user inquiries on help in emergencies, although much less prominent than in the current study, was reported in previous related research. Reference Birkun and Gautam6,Reference Dahdah, Kassab, Helou, Gaballa, Sayles and Phelan7
Along with that, the chatbots’ responses commonly included directions which were guidelines-compliant but inappropriate for an untrained rescuer (eg, advice to give rescue breaths), or contained AI hallucinations — incorrect and nonsensical guidance that represent risk of harm, since it may sound believable for an unfamiliar user. All the hallucinations were generated by Bard. These findings are contrasting with results of previous exploratory studies Reference Birkun and Gautam6,Reference Dahdah, Kassab, Helou, Gaballa, Sayles and Phelan7 which reported that LLM-based chatbots (Bing and ChatGPT [OpenAI; San Francisco, California USA]) did not instruct to perform harmful actions in a range of health emergencies.
Further, this study showed that the chatbots substantially over-estimated the quality of their advice on help for a non-breathing victim in terms of compliance with the resuscitation guidelines. Also, when being asked to enhance the responses’ content to make the advice fully guideline-concordant and applicable for an untrained rescuer, the chatbots corrected their responses, but the improvement was negligible and quality of the instructions remained low. Potentially harmful guideline-inconsistent advice and instructions inappropriate for an untrained bystander were mostly kept in place.
Taken together, these observations indicate that currently neither Bing nor Bard should be considered as a source of reliable guideline-consistent information on resuscitation, and the chatbots cannot be utilized to detect quality flaws or enhance quality of such information. Moreover, the artificial hallucinations generated by Bard may sound convincing for an incompetent user and therefore create an apparent risk of causing harm in case the user will take action following the chatbot advice.
Although the developers of Bing and Bard give up responsibility by asserting that the chatbots can make mistakes, provide incomplete, inaccurate, or inappropriate responses, 22,27 one should consider that a large portion of users may neglect the disclaimers, whereas the ever-increasing popularity of the LLM-powered chatbots along with their integration into the search engines and mobile devices would probably greatly intensify public use of these tools as an everyday source of informational support, including in real-life health emergencies. This stipulates the need on the one hand to enhance laypeople’s awareness of potential risks related with reliance on the chatbots’ advice in health crises instead of seeking professional help, and on the other hand, to develop regulatory procedures aimed at elimination of potential harm from the chatbot-generated misinformation by replacing the uncontrollable LLM-mediated question answering to the health-related questions with reliable human expert-developed advice. Both tasks would require commitment and close collaboration of the AI chatbot developers with recognized public health organizations.
Limitations
This study has limitations. Both tested chatbots currently run in a pilot version. Performance of the chatbots could change as a result of evolution of the question-answering AI algorithms. Repeated investigation carried out at a later point in time, with different search queries, languages, or search regions, may produce different results. Reproducibility of the research findings is further limited by the dynamic nature of the internet utilized by the chatbots as a source of information.
Conclusions
The LLM-powered chatbots readily respond to user inquiries concerning advice on help to a non-breathing victim by generating clearly understandable summarized answers containing instructions on resuscitation. However, the responses always omit essential details on the life-saving technique and occasionally contain deceptive, nonsensical directives which create risk for inadequate care and harm to a victim. The chatbots over-estimate the quality of their responses and were unable to improve their advice to achieve congruence with the current resuscitation guidelines. Along with further research aimed at better understanding possible use of the LLM-based chatbots in emergency medicine, regulatory actions are required to mitigate risks related to the AI-generated misinformation.
Conflicts of interest
A.A.B. and A.G. have no conflicts of interest.
Supplementary Materials
To view supplementary material for this article, please visit https://doi.org/10.1017/S1049023X23006568