Background
Artificial intelligence (AI) is defined as the development of computer systems capable of performing actions that usually require human intelligence. AI is rapidly evolving, with applications expanding across numerous fields, including healthcare. Reference Howard, Hope and Gerada1–Reference Alowais, Alghamdi and Alsuhebany3 Infectious disease (ID) specialists, frequently consulted for guidance on appropriate antimicrobial therapy, provide critical recommendations tailored to specific clinical scenarios. However, there is a national shortage of ID specialists leaving many institutions without any access to their expertise. This results in substantial variability in ID care across healthcare settings with rural communities being particularly negatively impacted. ID specialists are essential to lead nationally mandated antibiotic stewardship programs, yet according to recent data from the CDC, less than 50% of programs have an ID-trained physician leader. Reference Walensky, McQuillen, Shahbazi and Goodson4
This gap presents an opportunity for AI-based tools, such as large language models (LLMs), and machine learning algorithms, to enhance clinical decision-making. Reference Andrikyan, Sametinger and Kosfeld5 These technologies are capable of generating rapid responses to clinical queries, offering support in diagnosing conditions and suggesting treatment plans, thereby improving the efficiency of medical practice. Reference O’Leary, Neuhauser, McLees, Paek, Tappe and Srinivasan6–Reference Lo8 Machine learning (ML) holds the potential to revolutionize healthcare by analyzing large datasets to identify patterns and insights that may elude human observation. Reference Alowais, Alghamdi and Alsuhebany3 A recent study highlighted that, evaluators preferred responses from ChatGPT to those provided by physicians in addressing patient queries, underscoring AI’s ability to deliver high-quality and empathetic responses. Reference Ayers, Poliak and Dredze9 Additionally, ChatGPT has shown promise by accurately answering medicine-related multiple-choice and single-choice questions in standardized assessments. Reference Hoch, Wollenberg and Lüers2
Despite these advances, concerns remain regarding the variability in performance among AI models, particularly when addressing complex clinical cases such as those that often comprise the bulk of ID physicians’ practice. This highlights the need for rigorous evaluation of AI tools to ascertain their accuracy, reliability, and feasibility for integration into routine medical practice. Reference Ayers, Poliak and Dredze9,Reference Schinkel, Boerman and Bennis10 In this study, we evaluated the performance of ML and LLMs in recommending appropriate antibiotic therapy for various infectious diseases. By comparing these recommendations to those provided by standard care practices, we sought to assess the accuracy, limitations, and clinical applicability of these tools to better understand the potential role and safety of AI in antimicrobial prescribing.
Methods
Systematic literature review and inclusion and exclusion criteria
This review was conducted according to the Preferred Reporting Items for Systematic Reviews (PRISMA) statement. Reference Moher, Liberati, Tetzlaff and Altman11 This study was registered on PROSPERO (https://www.crd.york.ac.uk/PROSPERO/) on September 26, 2024(CRD42024594704). Institutional Review Board approval was not required for this work. The review included manuscripts published from the inception of each database to the present, without language restrictions. The literature search covered studies from inception through October 25, 2024. Eligible studies met the following inclusion criteria: original research articles published in peer-reviewed journals, conducted in healthcare settings, and evaluating the use of AI, ML, or clinical decision support systems (CDSS) in managing infectious diseases. ML models, such as random forests and gradient-boosted decision trees, were used to predict antimicrobial resistance and optimize antibiotic selection by analyzing large datasets to uncover actionable patterns. LLMs (eg, ChatGPT) are designed to interpret and generate human-like text. Exclusion criteria included case reports, commentaries, pilot studies, and studies focusing solely on diagnosis without addressing treatment.
Search strategy
We performed a comprehensive search of the literature in PubMed, CINAHL, Scopus, Web of Science, Embase, and Google Scholar in collaboration with an experienced health sciences librarian (B.H.) (Supplementary Appendix Table 1). Reference lists of retrieved articles were reviewed using Covidence software 12 to identify additional relevant studies. Two investigators (S.M.A. and A.R.M.) independently screened titles and abstracts, applying the inclusion criteria to exclude irrelevant studies. Discrepancies were resolved by consensus. This systematic review was guided by the PICO framework Reference Eriksen and Frandsen13 , focusing on patients with infectious diseases (P), interventions using AI-based management (I), comparisons with standard management provided by usual care providers (C), and primary outcomes (O) including the accuracy, efficacy, and limitations of AI in antimicrobial management.
Data abstraction and quality assessment
Of the twelve independent reviewers (A.R.M., D.F., J.I.R., M.A., M.K.H., M.A.SH., N.A.B., N.O.M., P.D., P.S.M., S.TH., T.K.), two independently abstracted data from each included study using a standardized data collection form (supplementary appendix). Extracted data included study design, publication year, calendar period, AI methodology, and comparisons with usual care providers where applicable. Reviewers also documented sensitivity and specificity of AI models, clinical impact, advantages, and limitations.
Risk of bias was assessed using a modified version of the Downs and Black scale. Reference Downs and Black14 The scale, with a maximum possible score of 28, evaluates quality across domains including reporting, internal validity, and external validity. The reviewers independently scored each study, resolving discrepancies through consensus.
Results
Characteristics of included studies
1,578 articles were retrieved. After applying exclusion criteria, 154 studies were reviewed in full, of which 23 met the inclusion criteria Reference İlhanlı, Park, Kim, Ryu, Yardımcı and Yoon15–Reference Gohil, Septimus and Kleinman37 and were included in the final analysis (Figure 1). These included eleven cohort studies (nine retrospective and two prospective), three qualitative studies, two cross-sectional studies, two quasi-experimental studies, and five randomized control studies (Table 1). Of these, 17 studies focused on AI applied as ML algorithms integrated into clinical decision support systems (CDSS) to enhance clinical outcomes Reference İlhanlı, Park, Kim, Ryu, Yardımcı and Yoon15–Reference Williams, Martin and Nian28,Reference Gohil, Septimus and Kleinman36,Reference Gohil, Septimus and Kleinman37 , while the remaining six studies evaluated various LLMs Reference Draschl, Hauer and Fischerauer29–Reference Fisch, Kliem, Grzonka and Sutter34 , including ChatGPT Reference Draschl, Hauer and Fischerauer29–Reference Maillard, Micheli and Lefevre33 . Geographically, six studies were conducted in the United States Reference MacFadden, Coburn and Shah21,Reference Ghamrawi, Kantorovich and Bauer23,Reference Ciarkowski, Timbrook and Kukhareva24,Reference Williams, Martin and Nian28,Reference Gohil, Septimus and Kleinman36,Reference Gohil, Septimus and Kleinman37 , three in South Korea Reference İlhanlı, Park, Kim, Ryu, Yardımcı and Yoon15,Reference Lee, Seo and Kim18,Reference Tran-The, Heo and Lim22 , and one each in Austria Reference Draschl, Hauer and Fischerauer29 , Australia Reference Wu, McLeod and Blyth35 , Cambodia Reference Oonsivilai, Mo and Luangasanatip20 , Canada Reference MacFadden, Coburn and Shah21 , China Reference Tao, Liu, Cui, Wang, Peng and Nahata32 , France Reference Maillard, Micheli and Lefevre33 , Germany Reference Neugebauer, Ebert and Vogelmann19 , Israel Reference Lewin-Epstein, Baruch, Hadany, Stein and Obolski16 , Italy Reference De Vito, Geremia and Marino30 , Tanzania Reference Tan, Kavishe and Luwanda27 , Turkey Reference Cakir, Caglar and Sekkeli31 , the Netherlands Reference Herter, Khuc and Cinà26 , Switzerland Reference Fisch, Kliem, Grzonka and Sutter34 , the United Kingdom Reference Bolton, Wilson, Gilchrist, Georgiou, Holmes and Rawson25 , and Vietnam. Reference Tran Quoc, Nguyen Thi Ngoc and Nguyen Hoang17 The studies were conducted between 2017 and 2024, with durations ranging from two weeks to ten years. The studies examined AI in two domains (Table 2). The first included seventeen studies exploring the integration of AI into CDSS, focusing on antimicrobial resistance prediction, the appropriateness of antibiotic prescriptions, antimicrobial stewardship, and the transition from intravenous to oral antibiotic therapy. The second domain involved six studies that evaluated the performance of LLMs in addressing infectious disease management, highlighting both successes and limitations across a range of conditions (Table 2).

Figure 1. Literature search for articles that evaluated the performance and effectiveness of artificial intelligence or machine learning in recommending appropriate antibiotics for various infectious diseases.
Table 1. Characteristics of included studies (N = 23)

Table 2. Summary of characteristics of studies included in the systematic review

AI, artificial intelligence; AMR, antimicrobial resistance; ASP, antimicrobial stewardship; AUC, area under the curve; BSI, bloodstream infection; CAP, community-acquired pneumonia; CDSS, clinical decision support systems; CPOE, computerized provider order entry; GBDT, Gradient Boosted Decision Tree; GBDT, Gradient Boosted Decision Tree; HER, health electronic records; IV, intravenous; LLM, large language model; ML, machine learning; PJI, periprosthetic joint infection; RCT, randomized clinical trial; UTI, urinary tract infection.
AI in clinical decision support systems: antimicrobial stewardship (ASP)
Five studies demonstrated substantial potential of AI in enhancing antimicrobial stewardship. Reference Tran-The, Heo and Lim22–Reference Ciarkowski, Timbrook and Kukhareva24,Reference Gohil, Septimus and Kleinman36,Reference Gohil, Septimus and Kleinman37 One study found that ML models identified 60% of cases for antibiotic discontinuation, compared to 19% in usual care, with a 98% success rate in transitioning to oral antibiotics. Reference Tran-The, Heo and Lim22 Another study highlighted that AI systems shortened antibiotic de-escalation by 24 hours. Reference Ghamrawi, Kantorovich and Bauer23 A third study showed that CDSS combined with active ASP achieved better antibiotic optimization for community-acquired pneumonia (CAP) compared to the absence of ASP. Reference Ciarkowski, Timbrook and Kukhareva24 Additionally, two INSPIRE (Intelligent Stewardship Prompts to Improve Real-Time Empiric Antibiotics Section) randomized clinical trials evaluated the impact of a CPOE (computerized provider order entry) bundle versus routine ASP on empiric antibiotic prescribing for pneumonia and urinary tract infection (UTI) in non-critically ill adults. Reference Gohil, Septimus and Kleinman36,Reference Gohil, Septimus and Kleinman37 The CPOE reduced extended-spectrum antibiotic therapy days by 28.4% for pneumonia and 17.4% for UTIs, with similar reductions for vancomycin and antipseudomonal use. No significant differences in safety outcomes or ICU transfers were observed in either trial.
Antimicrobial resistance prediction
Four studies evaluated AI’s use in predicting antimicrobial resistance. Reference İlhanlı, Park, Kim, Ryu, Yardımcı and Yoon15–Reference Lee, Seo and Kim18 One study focused on random forest ML models to predict antibiotic resistance in UTIs, achieving area under the receiver operating characteristic curves (AUROCs) ranging from 0.777 for cephalosporin to 0.884 for fluoroquinolones, with fluoroquinolones showing superior performance. Reference İlhanlı, Park, Kim, Ryu, Yardımcı and Yoon15 Another study employed a CDSS integrated with ML and electronic health records, achieving AUROCs of 0.8 to 0.88 for resistance prediction when bacterial species data were included. Reference Lewin-Epstein, Baruch, Hadany, Stein and Obolski16 In the intensive care, a third study found that a random forest model had high specificity (0.84–0.99), while XGBoost and LightGBM demonstrated better sensitivity (up to 0.95). Reference Tran Quoc, Nguyen Thi Ngoc and Nguyen Hoang17 A final study used gradient-boosted decision trees to predict ciprofloxacin resistance and extended-spectrum beta-lactamase (ESBL) production in patients with UTI, achieving high sensitivity (∼93%) but lower specificity (31–45%). Reference Lee, Seo and Kim18 These findings show that machine learning performance varies with clinical context, objectives, and hospital resistance patterns.
Appropriateness of antibiotic prescriptions
Three studies evaluated AI-enhanced CDSS systems designed to improve antibiotic prescribing. Reference Neugebauer, Ebert and Vogelmann19–Reference MacFadden, Coburn and Shah21 One study demonstrated that a CDSS integrated with national guidelines improved diagnostic accuracy and reduced unnecessary antibiotic use in UTIs, increasing prescriber confidence. Reference Neugebauer, Ebert and Vogelmann19 A second study employed a random forest model within a CDSS to guide antibiotic selection for pediatric infections, particularly improving predictions of ceftriaxone resistance, with an estimated AUROC of 0.80. Reference Oonsivilai, Mo and Luangasanatip20 A third study assessed a CDSS leveraging patient data and prior cultures to improve empiric therapy for gram-negative bloodstream infections, enabling narrower-spectrum antibiotic use in 78% of cases while reducing inadequate treatment. Its impact varied with antibiotic susceptibility thresholds, and multivariable models showed good discrimination, with AUROCs of 0.68–0.89 for Gram-stain-guided and 0.75–0.98 for pathogen-guided models. Reference MacFadden, Coburn and Shah21
Accuracy of management
Three studies examined the accuracy of AI-enhanced CDSS in managing infections. Reference Herter, Khuc and Cinà26,Reference Tan, Kavishe and Luwanda27,Reference Wu, McLeod and Blyth35 One study reported a 5% improvement in treatment success for UTIs (defined as the absence of antibiotic prescriptions within 28 d) compared to usual care. Reference Herter, Khuc and Cinà26 Another study demonstrated that integrating CDSS with point-of-care testing significantly reduced antibiotic prescriptions by (∼47%) on day 0, contributing to improved antimicrobial stewardship. Reference Tan, Kavishe and Luwanda27 A third study used a Bayesian network to guide personalized therapy for osteomyelitis in children, achieving expert-rated optimal or adequate recommendations in 82%–98% of cases, despite initial underestimation of Staphylococcus aureus prevalence. Reference Wu, McLeod and Blyth35
Transition from intravenous to oral antibiotics
Two studies focused on AI-assisted transitions from intravenous to oral antibiotics. Reference Ciarkowski, Timbrook and Kukhareva24,Reference Bolton, Wilson, Gilchrist, Georgiou, Holmes and Rawson25 One study, conducted in three phases, demonstrated that a CDSS facilitated earlier conversion to oral antibiotics, reducing the duration of intravenous therapy and achieving a 20% cost reduction without affecting length of stay. Reference Ciarkowski, Timbrook and Kukhareva24 Another study used machine learning to individualize the transition, reducing the Antimicrobial Spectrum Index by a mean of 23%, though variability in outcomes was noted. Reference Bolton, Wilson, Gilchrist, Georgiou, Holmes and Rawson25
Adherence to guidelines
One study evaluated AI-based CDSS for promoting guideline-concordant antimicrobial prescriptions in pediatric patients with community-acquired pneumonia. The system improved adherence to guidelines by 10% compared to usual care and facilitated faster initiation of antibiotics. Reference Williams, Martin and Nian28
Large language models:
Six studies examined the use of AI across various LLMs. Reference Draschl, Hauer and Fischerauer29–Reference Fisch, Kliem, Grzonka and Sutter34 While AI language models performed well on straightforward queries, they struggled with complex cases that required nuanced clinical judgment. Nevertheless, three studies found consistent results for simple queries, with responses receiving high scores on different scales such as Likert and Global Quality Scales. Reference Draschl, Hauer and Fischerauer29–Reference Cakir, Caglar and Sekkeli31
Accuracy of management:
Five studies assessed the accuracy of various LLMs in responses related to management. Reference Draschl, Hauer and Fischerauer29,Reference De Vito, Geremia and Marino30,Reference Tao, Liu, Cui, Wang, Peng and Nahata32–Reference Fisch, Kliem, Grzonka and Sutter34 One study found that AI performed poorly on management-related questions for prosthetic joint infections, achieving less than 45% accuracy in such scenarios. While significant inter-rater reliability was observed for responses on diagnosis and treatment, these responses received the lowest scores in this subtopic of treatment, indicating low trustworthiness. Reference Draschl, Hauer and Fischerauer29 In another study examining various clinical scenarios, including CAP, bloodstream infections, endocarditis, and meningitis, GPT-4.0 demonstrated about 70% accuracy in true/false questions. However, its performance greatly declined on more complex questions, with accuracy dropping to approximately 37.5%. Subtopic analysis showed over 80% accuracy in endocarditis cases but less than 50% in CAP cases. Additionally, the AI often recommended overtreatment and failed to consider newer antibiotics, achieving only 10% accuracy in complex cases like endocarditis and bacteremia. Reference De Vito, Geremia and Marino30 AI also faced challenges with infections in vulnerable populations, delivering incomplete or incorrect answers in up to 80% of pediatric cases. Error rates for treatment responses across children, pregnant individuals, adults, those with drug allergies, and patients with chronic kidney disease ranged from 11% to 44%, with the highest inaccuracies observed in pediatric-related questions. Reference Tao, Liu, Cui, Wang, Peng and Nahata32 In a study evaluating ChatGPT’s responses for managing bloodstream infections, it provided appropriate and optimal suggestions in about 35% of cases. However, it offered inadequate or potentially harmful recommendations in 3% to 34% of cases for both definitive and broad-spectrum therapies. The remaining responses were deemed appropriate but not optimal. Reference Maillard, Micheli and Lefevre33 In one study evaluating the accuracy of seven different LLMs in managing meningitis, GPT-4 showed the most consistent performance, providing over 80% correct answers across all tasks (beyond just treatment suggestions). When asked about the correct empirical treatment, the models provided accurate suggestions in approximately 38% of cases, with Claude-2 and GPT-4 performing the best. Specifically, in recommending the addition of antivirals, only 33% of the models suggested it, with just half providing the correct dosage. Over 60% of the models opted not to provide a dosing recommendation. Reference Fisch, Kliem, Grzonka and Sutter34
Adherence to guidelines
Two studies assessed AI’s adherence to guideline. Reference Cakir, Caglar and Sekkeli31,Reference Fisch, Kliem, Grzonka and Sutter34 For UTI, LLMs demonstrated 87% adherence to guidelines but produced incorrect responses in >10% of cases, raising safety concerns. Reference Cakir, Caglar and Sekkeli31 One study noted that adherence to guidelines across seven LLMs ranged from 53% to 85%, with lower task completion correlating to reduced response consistency. Reference Fisch, Kliem, Grzonka and Sutter34
Antimicrobial resistance:
One study revealed that untrained GPT-4.0 demonstrated lower accuracy in identifying the correct resistance mechanism, with the best-suggested answers being in the subtopic of community-acquired pneumonia. Reference De Vito, Geremia and Marino30
Quality assessment
Using the Downs and Black tool, 15 studies were rated as high quality, with scores ranging from 19 to 25 out of 28 points. Reference Tran Quoc, Nguyen Thi Ngoc and Nguyen Hoang17,Reference Lee, Seo and Kim18,Reference Oonsivilai, Mo and Luangasanatip20–Reference Tran-The, Heo and Lim22,Reference Bolton, Wilson, Gilchrist, Georgiou, Holmes and Rawson25–Reference De Vito, Geremia and Marino30,Reference Maillard, Micheli and Lefevre33,Reference Wu, McLeod and Blyth35–Reference Gohil, Septimus and Kleinman37 Six studies were rated as fair, scoring between 14 to 18 points Reference Lewin-Epstein, Baruch, Hadany, Stein and Obolski16,Reference Neugebauer, Ebert and Vogelmann19,Reference Ghamrawi, Kantorovich and Bauer23,Reference Ciarkowski, Timbrook and Kukhareva24,Reference Tao, Liu, Cui, Wang, Peng and Nahata32,Reference Fisch, Kliem, Grzonka and Sutter34 , and two were classified as low quality, scoring 13 or fewer points. Reference İlhanlı, Park, Kim, Ryu, Yardımcı and Yoon15,Reference Cakir, Caglar and Sekkeli31 Detailed results are available in the Supplementary Material.
Discussion
In this systematic review, we evaluated the effectiveness of artificial intelligence (machine learning and large language models) in guiding appropriate antibiotic prescriptions for various infectious diseases. This review showed that AI, particularly when integrated into clinical decision support, can enhance clinical decision-making by improving antimicrobial resistance predictions and optimizing antibiotic selection. Seventeen studies illustrated the positive impact of AI in reducing unnecessary antibiotic use and improving treatment outcomes. However, LLMs, such as ChatGPT, performed markedly less effectively in complex management scenarios, frequently producing substantial errors that could compromise patient safety. These contrasting results emphasize the importance of context when implementing AI tools in clinical practice, reinforcing the critical need for consultation with infectious disease specialists to ensure accurate and individualized treatment strategies.
Previous studies explored the performance of ChatGPT in responding to multiple-choice and single-choice questions, demonstrating superior accuracy in single-choice formats Reference Hoch, Wollenberg and Lüers2 . However, as case complexity increased, accuracy declined significantly, a pattern consistent with our findings. While AI, including ChatGPT, holds promise in certain applications, its use in clinical settings must be approached with caution. Success in one domain does not necessarily translate across all specialties. Although AI-powered chatbots can deliver detailed drug information, many responses have been found to be inaccurate or potentially harmful, as documented in previous research. Reference Walensky, McQuillen, Shahbazi and Goodson4,Reference O’Leary, Neuhauser, McLees, Paek, Tappe and Srinivasan6 The main challenges to implementing LLMs in clinical practice are their lack of situational awareness, inference ability, and consistency, which could jeopardize patient safety. Reference Olakotan, Yusof and Puteh39 This aligns with the results of our review.
The studies included in this analysis assessed AI’s role in optimizing antibiotic prescriptions Reference Neugebauer, Ebert and Vogelmann19–Reference MacFadden, Coburn and Shah21 , adherence to guidelines Reference Williams, Martin and Nian28 , accuracy in infection management Reference Herter, Khuc and Cinà26,Reference Tan, Kavishe and Luwanda27,Reference Wu, McLeod and Blyth35 , antimicrobial stewardship Reference Tran-The, Heo and Lim22–Reference Ciarkowski, Timbrook and Kukhareva24,Reference Gohil, Septimus and Kleinman36,Reference Gohil, Septimus and Kleinman37 , and facilitating early transitions from intravenous to oral antibiotics. Reference Ghamrawi, Kantorovich and Bauer23–Reference Bolton, Wilson, Gilchrist, Georgiou, Holmes and Rawson25 A total of 23 studies were analyzed, including 18 non-randomized and five randomized studies. The AI systems varied considerably in their methodologies, with half of the studies demonstrating moderate methodological quality, while the remainder were rated fair to low. Our findings suggest that ML algorithms perform well in predicting antimicrobial resistance, with high sensitivity rates. This capability allows for earlier narrowing of antibiotic choices before culture results are available. However, the low specificity of these models necessitates cautious interpretation. Reference İlhanlı, Park, Kim, Ryu, Yardımcı and Yoon15–Reference Lee, Seo and Kim18
AI also showed promise in facilitating early transitions from intravenous to oral antibiotics, potentially reducing hospital stays, and possibly lowering healthcare costs. Reference Ciarkowski, Timbrook and Kukhareva24,Reference Bolton, Wilson, Gilchrist, Georgiou, Holmes and Rawson25 However, these systems face significant limitations. For instance, high error rates were observed in complex clinical cases (especially evident in LLM) and while AI occasionally flagged contraindicated antibiotics, this function was inconsistently applied. Incorrect antibiotic recommendations in such cases could result in serious, even fatal, outcomes. Furthermore, the integration of AI into hospital systems requires substantial resources, including manpower, financial investment, and technical infrastructure, which may not be feasible in all healthcare settings. Excessive alerts generated by AI systems can lead to alert fatigue, a phenomenon well documented in prior studies Reference Ghamrawi, Kantorovich and Bauer23,Reference Gohil, Septimus and Kleinman36–Reference Kufel, Hanrahan and Seabury38 potentially diminishing trust and engagement among healthcare providers. While promising, LLM agents present risks and safety concerns, bias, over-reliance, and the need for strong regulation. Reference Olakotan, Yusof and Puteh39 Liability guidelines must evolve to address the dynamic nature of LLM-based systems, which adapt and evolve through interactions with external resources. Current static regulations fall short, requiring proactive measures to anticipate issues and failures. Reference Olakotan, Yusof and Puteh39
This review has several limitations. First, the majority of the included studies were non-randomized, and non-blinded which could affect the reliability and robustness of the findings. Second, there was a lack of focus on pediatric populations; only three studies specifically addressed pediatric groups, while one included vulnerable population, including children, but was not exclusively focused on them, limiting the generalizability of the findings to this demographic. Third, most studies did not directly compare AI-driven management strategies with recommendations from ID specialists, thus their role in the context of existing medical resources may remain inconclusive, which may have influenced the accuracy assessments. Fourth, while four studies evaluated antimicrobial resistance, the rapidly evolving nature of resistance poses challenges to the sustained relevance of AI models, potentially limiting their future applicability, and limiting their long-term utility. Furthermore, variability in expert opinion complicates the interpretation of AI-generated recommendations. Additionally, most included studies lacked detailed descriptions of the source materials used for AI tool development and inconsistently reported the incorporation of patient-specific parameters, limiting the evaluation of potential biases and the extent to which these tools fulfill the promise of personalized medicine. Finally, while clinical decision support can optimize antibiotic dosing by incorporating electronic health record data, such as renal and liver function, the high costs associated with implementing these systems may limit their accessibility, particularly in resource-constrained settings. Additionally, while we used the Downs and Black checklist for quality assessment Reference Downs and Black14 , which is validated for clinical studies, the Roosan D. checklist Reference Roosan40 , has not been validated for studies involving LLMs, which are included in our analysis.
In conclusion, AI holds promise, particularly in predicting antimicrobial resistance and optimizing antibiotic use. However, its current limitations highlight the essential role of infectious disease specialists in providing precise, personalized, and comprehensive care. The safe and effective integration of AI into clinical practice will depend on rigorous validation, continuous updates, and close collaboration with human expertise to ensure optimal outcomes.
Supplementary material
The supplementary material for this article can be found at https://doi.org/10.1017/ash.2025.47
Acknowledgments
We would like to acknowledge the authors of the primary studies included in this review and the departments of Infection Prevention and Antimicrobial Stewardship at Stanford Healthcare.
Financial support
This study was self-funded.
Competing interests
Authors have no commercial or financial involvement with this manuscript.