Introduction
Healthcare-associated infections (HAIs) cause tens of thousands of deaths and cost around $3 and $27 billion annually in England and the U.S., respectively. Reference Guest, Keating, Gould and Wigglesworth1 Infection prevention and control (IPC) teams are critical to mitigating HAI pathogen transmission, but typical IPC methodologies rely on infection control professional intuition, are time consuming and can either over- or under-identify outbreaks Reference Peacock, Parkhill and Brown2 As whole genome sequencing (WGS) becomes more affordable, routine WGS has proven effective in detecting pathogen clusters not meeting typical IPC HAI transmission criteria. For example, Sundermann et al. found that over 10% of isolates were part of genetically related clusters, and only 44% of the isolates identified to be genetically related were classified as transmissions using National Healthcare Safety Network (NHSN) criteria. Reference Sundermann, Penzelik, Ayres, Snyder and Harrison3,Reference Sundermann, Chen and Kumar4 Similarly, Australian investigators used WGS for HAI transmission surveillance and found that over 30% of patients acquired multi-drug-resistant (MDR) pathogens from the hospital. Reference Sherry, Gorrie and Kwong5 Other research has demonstrated the effectiveness of WGS in distinguishing between methicillin-resistant Staphylococcus aureus (MRSA) outbreaks and pseudo-outbreaks. Reference Talbot, Jacko and Petit6,Reference Blane, Raven and Brown7 These WGS-based efforts highlight the importance of timely identification of HAI transmission in order to facilitate outbreak control by IPC teams. A major barrier to real-time HAI analysis using WGS is the reliance on highly accurate Illumina short-read sequencing, which requires extensive preparation and batch processing, particularly for institutions that do not have large sequencing facilities. Reference Maljkovic Berry, Melendrez and Bishop-Lilly8
Oxford Nanopore Technologies (ONT) provide a long-read sequencing alternative that generally requires less sample preparation, can be adapted to sequence varying numbers of strains, and facilitates complete genome assemblies including plasmids. Reference Wagner, Dabernig-Heinz and Lipp9 Whereas ONT previously had unacceptably high error rates for assessing bacterial genetic relatedness, advancements such as improved V14 chemistries, double-sensor R10 nanopores, and enhanced consensus basecalling models have significantly increased accuracy. Reference Zhang, Li and Ma10 Thus, ONT sequencing could be a viable standalone real-time WGS HAI analysis option. In this study, we aimed to establish an ONT-only sequencing pipeline capable of rapidly and accurately producing WGS data to classify the genetic relatedness among potentially transmitted MDR pathogens in a tertiary care cancer hospital.
Methods
Study population
The study took place between August 2023 and March 2024 at the University of Texas MD Anderson Cancer Center (MDACC), a 760-bed tertiary care facility in Houston, TX, USA, and was approved by the MDACC quality improvement institutional review board. To optimize the chances of identifying transmitted MDR pathogens, we studied organisms previously identified at high-risk of HAI transmission, namely MRSA, vancomycin-resistant Enterococcus faecium (VREfm), and carbapenem-resistant forms of Enterobacterales, Acinetobacter baumannii, and Pseudomonas aeruginosa. Reference Sundermann, Chen and Kumar4,Reference Gorrie, Da Silva and Ingle11 Screening of the electronic health record was performed twice weekly to identify bacteria of interest with final inclusion being those isolated from patients hospitalized for ≥ 48 hours at the time of infection onset or from patients with recent contact (≤ 30 days) with the MDACC healthcare system (ie admitted to the hospital or undergoing an outpatient procedure).
Genomic DNA extraction, long-read sequencing, and data analysis
Genomic DNA (gDNA) was extracted directly from plates obtained from the MDACC clinical microbiology laboratory using the GenElute™ Bacterial Genomic DNA Kit. Long-read libraries were prepared using the ONT Rapid Barcoding Kit 96 V14 and were sequenced on the MinION device using R10.4.1 flow cells following manufacturer instructions. All pod5 reads were basecalled using Dorado v0.5.1 in super high accuracy mode ([email protected]) with a minimum quality score filter of 8 to produce FASTQ files. Dorado v0.5.1 was also used to demultiplex and remove adapters from the sequencing results (GitHub: https://github.com/nanoporetech/dorado). Long-read assemblies were generated using an in-house hybrid assembly pipeline (GitHub: https://github.com/wshropshire/flyest). AMR gene presence was assessed using AMRFinderPlus v3.11.14. Reference Feldgarden, Brover and Gonzalez-Escalona12
Accuracy assessment of stand-alone ONT sequencing
Illumina sequencing was performed on 55 (22%) samples using the NextSeq500 platform at the MDACC core sequencing facility with a target sequencing depth of ∼100×. Trimmed paired-end Illumina short reads were quality controlled using FastQC. We used these short reads as input for the variant calling pipeline Snippy v4.6.0 with default variant calling parameters (https://github.com/tseemann/snippy) to assess the accuracy of complete ONT long-read assemblies (ie self vs self). In particular, variants detected using this pipeline were considered potential ONT long-read assembly errors. Using these results, we calculated the mean identity score (ie Q score) as follows: Q score = −10 * log10 (total number of variants identified by Snippy/genome length assembled by Flyer). We utilized Rasusa v0.8.0 (GitHub: https://github.com/mbhall88/rasusa) to subsample our ONT FASTQ files to coverages of 100×, 80×, 60×, 40×, and 20× for the 35 strains with at least 100× ONT coverage. The subsampled data were used to generate assembled genomes, and the Illumina data were used to identify potential errors as a function of coverage depth using the approach described above.
Measures of genetic relatedness
Sequence types (ST) and clonal complexes (CC) were determined by PubMLST databases. Reference Jolley and Maiden13 To identify strains with sufficient genetic similarity to indicate potential transmission, we conducted a two-step screening with cutoff values for different species derived from previously published studies (Table 1). Reference Kinnevey, Kearney and Shore14–Reference Martak, Meunier and Sauget17 First, SeqSphere+ software (Ridom SeqSphere+ version 9.0.10) was used for core genome multilocus sequence typing (cgMLST). The FASTA file from the ONT assembly was utilized, requiring at least 95% of cgMLST target genes for subsequent analysis. For strains that met the cgMLST cutoff, reference-based single nucleotide polymorphism (SNP) calling was performed using MINTyper version 1.1.0 directly on the FASTQ files generated from Dorado demultiplexing Reference Hallgren, Overballe-Petersen, Lund, Hasman and Clausen18 For strains where cgMLST schema are currently not available in SeqSphere+ (eg Enterobacter cloacae), genetic relatedness was exclusively assessed using MINTyper with cutoffs derived from previously published data. Reference Sundermann, Chen and Miller19
Integration of genetic and epidemiologic data to identify potential transmission
For strains that met our potential transmission genetic cutoff values, the IPC team was notified and potential epidemiologic links were evaluated using standard procedures such as physical location during each MDA admission including service line at time of isolate identification as well as common procedures. Given the observational nature of the study, no IPC interventions were made based on these data. Ultimately, transmission likelihood was classified based on these previously published definitions: (1) probable transmission assigned to patients who stayed on the same ward with at least 24 hours of overlap; (2) possible transmission being for patients who stayed in the same ward within 60 days without overlap; and (3) unlikely transmission when neither of the above criteria were met. Reference Sherry, Gorrie and Kwong5 Figure 1 illustrates the study workflow.
Study cost estimation
The sequencing was performed by a single graduate student working 100% on the project. Labor costs were based on average MDACC cost for a clinical microbiology laboratory technician with 100% effort as of October 2024. Other costs included genomic DNA extraction kits, quality control assessment, ONT library preparation, and flow cell use.
Results
Standalone ONT sequencing data accuracy
ONT sequencing was performed on 242 unique clinical isolates from 216 patients with a species breakdown of 83 MRSA (33%), 41 VREfm (17%), 37 P. aeruginosa (15%), 31 E. coli (13%), 26 K. pneumoniae (11%), and 24 other species (10%) (Table S1). Once we had performed ONT sequencing of at least ten isolates of the five top species, we performed parallel Illumina sequencing on the same genomic DNA. We mapped the Illumina reads to the genomes assembled with ONT data to determine the number of SNPs/insertion-deletion (INDELs) (ie errors) in the ONT assemblies. The median number of SNPs was 1, interquartile range (IQR) for SNPs was 0–2, and the median number of (INDELs) was 3 (IQR: 1–5) (Figure 2A). The ONT-assembled genomes average Q score was 60.7 (IQR: 56.9–64.5). We observed that 87% of SNPs were transitions, with 42% being T to C and 33% being A to G (Figure 2B). The Integrative Genomics Viewer of an example SNP is shown in Figure 2C. We conclude that relative to the Illumina gold standard, the ONT sequencing was highly accurate, although still with a low level of A to G and T to C transitions likely due to unique methylation motifs not being properly accounted for in the Dorado basecalling models (ie super high accuracy model v4.3) we used at the time of the study. Reference Sanderson, Hopkins and Colpus20
Higher sequencing coverage and coverage depth up to 200× generally improves assembly accuracy by reducing the likelihood of misassembly or missing sequences. Reference Sereika, Kirkegaard and Karst21,Reference Wick, Judd and Holt22 However, targeting lower coverage depth per sample allows for cost-effect sequencing of more samples per flow cell. Reference Sereika, Kirkegaard and Karst21 Thus, we aimed to optimize sequencing coverage to balance cost-effectiveness and accuracy. We used Rasusa to subsample the ONT files to achieve 100×, 80×, 60×, 40×, and 20× coverage. We found that there were statistically significant differences in total number of variants detected across the coverage depths we tested (P < 0.05 Kruskal–Wallis test), with a statistically significant higher number of total variants at 20× coverage relative to 100× coverage (P < 0.001 Pairwise Wilcoxon rank-sum test); however, no significant differences in total variants were observed for coverage depth ≥ 40× (Figure 2D) suggesting this may be a sufficient coverage depth cutoff where greater coverage depths may provide diminishing returns in assembly quality.
Execution of our standalone ONT sequencing workflow
Over a 26-week study period, we screened 494 positive AMR culture reports (Figure 1). Of these, 246 met our inclusion criteria, with three strains subsequently excluded due to discrepancies between species ID in the clinical microbiology laboratory and by sequencing analysis whereas one strain was excluded due to sequencing failure, leaving a total of 242 isolates for analysis. The most common isolate source was blood (36%, 88/242), followed by tissue/wound/body fluid (24%, 58/242), urinary tract (20%, 48/242), and respiratory tract (18%, 44/242). On average, 10 strains were sequenced per week. The fastest time from initiating gDNA extraction to completing data analysis was one day, with a mean duration of two days (IQR: 2–3.25 days). Detailed timelines of the workflow and sequencing quality control data are provided in Figure S1 and Table S1. Total cost for study execution including labor was estimated at $64,400 with a material cost of $11,300 based on 250 isolates sequenced using published prices as October 11, 2024, for an average of ∼$250/isolate including labor and ∼$45/isolate without labor.
Overview of major sequence types amongst sequenced pathogens
Phylogenetic trees for each of the five major species are shown in Figures S2-6. Among the 31 E. coli isolates, half belonged to ST131 (n = 16, 51.6%) followed by ST167 (n = 4, 13%) and ST361 (n = 2, 7%). The 26 K. pneumoniae isolates displayed diverse STs with ST258 being the most common (n = 4, 15%), followed by ST45 (n = 3, 12%) and ST147 (n = 3, 12%). For 37 P. aeruginosa strains, over 19% (n = 7) were identified as ST633, while other isolates had unique ST. ST117 (25 of 41 total isolates, 61%) dominated the VREfm population followed by ST80 (n = 6, 15%). Most MRSA isolates (n = 83) were from clonal complex 8 (CC8) (n = 38, 46%), CC5 (n = 30, 36%), and CC30 (n = 8, 10%). A heat map of major genes mediating acquired β-lactam resistance is shown in Figure S7. Except for the large numbers of P. aeruginosa ST633 strains, our AMR pathogen epidemiology generally aligned with predominant STs/CCs reported globally. Reference Turner, Sharma-Kuinkel and Maskarinec23–Reference Mills, Martin and Luo27
Characterization of potential transmission using genetic relatedness assessment
Using our two-step screening process with genetic cutoff values to assess transmission shown in Table 1, we identified 21 isolates (8.7%) meeting the criteria for possible transmission, forming five genetically related clusters (Figure 3). Three clusters were ST117 VREfm, and one was ST80 VREfm, with 46% of VREfm strains (19/41) meeting the genetic relatedness cutoff for possible transmission. Another cluster involved two ST45 K. pneumoniae isolates. Table S2 details the detected clusters. To visualize potential healthcare transmissions, we used patient-ward-movement timelines to integrate genomic and epidemiological data. Figure 4 shows the largest ST117 VREfm cluster, comprising 12 unique patients. Within this cluster, seven patients shared overlapping stays in ward 1 and ward 4 for at least one day, indicating a probable transmission. Two patients were on ward 2 at different times but within 60 days, suggesting a possible transmission. However, three patients did not have overlapping admissions or locations. Potential epidemiological links were present in >70% (15/21) of isolates that met our genetically related cutoff values.
Use of ONT data to assess potential plasmid transmission
Given the ability to assess completely resolved plasmids using ONT long-read sequencing, Reference Wick, Judd, Wyres and Holt28 we also sought to determine whether our ONT sequencing pipeline could assess the possibility of plasmid transfer among different bacterial strains/species. We focused on plasmids containing bla NDM genes, which encode New Delhi metallo-β-lactamase enzymes conferring resistance to a broad range of β-lactam antibiotics. Reference Sakamoto, Akeda and Sugawara29 Table S3 lists eight isolates carrying bla NDM genes from seven unique patients, with two isolates originating from the same urine culture but identified as strains with slightly different AMR profiles. Given that all detected bla NDM-5 genes were carried by IncF type replicon plasmids, we analyzed pairwise SNP distances to study potential horizontal plasmid transfer (Table S4). Only plasmids from the two E. coli strains from the same sample fell below our predefined cutoff value (<15 SNPs/100 Kb). Reference Raabe, Valek and Griffith30 Thus, we found no evidence of potential bla NDM transmission occurring during our study period.
Discussion
There is increasing recognition that analysis of genetic relatedness among bacteria causing infections in healthcare settings could substantially add to IPC efforts to mitigate pathogen acquisition. Reference Guest, Keating, Gould and Wigglesworth1,Reference Sundermann, Chen and Miller19 However, widespread use of such approaches is currently limited by a host of factors including difficulty providing timely data in a cost-effective fashion. Reference Werner, Couto and Feil31 Herein, we demonstrate that a low-resource infrastructure using an ONT sequencing pipeline can produce accurate WGS information in a time frame commiserate with impactful IPC interventions.
Given its flexibility and low instrumentation requirements, ONT sequencing has long been considered as potentially impactful on infectious diseases surveillance such as its use in field-based sequencing during the 2015 Ebola outbreak. Reference Quick, Loman and Duraffour32 However, prior to the release of the R10.4.1 flow cells and Dorado basecalling algorithm, the high ONT error rate meant that it was not sufficiently accurate to determine whether bacteria were potentially part of a transmission network. Reference Wagner, Dabernig-Heinz and Lipp9,Reference Sereika, Kirkegaard and Karst21 When analyzing a diverse array of AMR pathogens, we found that stand-alone ONT sequencing generated complete genomes that generally varied from the gold standard Illumina by only 1-2 SNPs, which is far less than the SNP thresholds used to call transmission (Table 1). These data are consistent with studies emerging from other investigations using recent ONT pipelines, which generally have analyzed historical cohorts. Reference Wagner, Dabernig-Heinz and Lipp9,Reference Sereika, Kirkegaard and Karst21,Reference Bogaerts, Van den Bossche and Verhaegen33 With advancing technology, it is likely that the ONT basecalling accuracy will continue to improve. Reference Dabernig-Heinz, Lohde and Holzer34,Reference Hall, Wick and Judd35 Moreover, our data were generated prospectively and analyzed by a single operator indicating the low resource utilization of our approach. In light of the tremendous impact of HAI pathogens, modeling has indicated potential cost savings with routine WGS Reference Fox, Saunders and Jerwood36 . However, the high costs of a core sequencing facility capable of generating Illumina data within a time frame needed for effective IPC efforts may limit its application to research centers. Reference Sundermann, Chen and Kumar4,Reference Sherry, Gorrie and Kwong5 We envision that ONT sequencing could be effectively utilized by a broad variety of biomedical facilities to limit HAI transmission, effectively democratizing the integration of WGS into IPC efforts. Reference Wagner, Dabernig-Heinz and Lipp9
Previously, routine sequencing of a diverse array of HAI pathogens showed that particular species were more likely to meet genetic cut-offs for being potentially healthcare transmitted, which caused us to focus on a high-risk group of bacterial pathogens. Reference Sundermann, Chen and Kumar4 Still, our finding that 8.7% of such isolates met the criteria for potential transmission was lower than the 10.8% reported in the Pittsburgh, U.S.A. based study Reference Sundermann, Chen and Kumar4 or the 30% reported in a recent investigation from Australia. Reference Sherry, Gorrie and Kwong5 One potential explanation is that with our relatively low number of samples, we lacked the power to fully identify clusters. Additionally, we almost exclusively analyzed clinical infection samples whereas many of the clusters identified in the Australian study came from active colonization surveillance, and thus we may have missed transmission instances that did not result in clinical disease. Finally, given the highly immunocompromised nature of our patients, robust IPC efforts at our hospital, such as routine gloves and masking when entering the rooms of certain types of patients, may mitigate transmission. In spite of some differences, a striking consistent finding between our studies and others is the very high rates of genetic relatedness among VREfm which was 46% in our study, 36% in the Pittsburgh study, and 92% in the Australian investigation. Reference Sundermann, Chen and Kumar4,Reference Sherry, Gorrie and Kwong5 These findings are not limited to a single ST and suggest that much work remains to be done regarding limiting transmission of VREfm in healthcare settings. Conversely, unlike several recent studies, we found no clusters of closely related MRSA strains despite sequencing 80+ isolates. Reference Blane, Raven and Brown7,Reference Kinnevey, Kearney and Shore14 Thus, our data also suggest that WGS resources could be targeted using adaptive strategies dependent upon local findings rather than sequencing all drug-resistant pathogens.
In summary, we demonstrate that a low-resource, stand-alone ONT sequencing platform shows promise for real-time monitoring of healthcare-associated transmission and outbreak detection. Integration of such an approach into IPC efforts could assist with limiting HAIs in a wide variety of healthcare settings.
Supplementary material
The supplementary material for this article can be found at https://doi.org/10.1017/ice.2024.202
Data Availability Statement
ONT sequencing FASTQ files in the NCBI BioProject database (PRJNA1150149). Scripts for basecalling are available at github.com/wshropshire/misc_scripts. Seqsphere+ output as well as R scripts used for infection control and phylogeny analysis can be accessed at the Figshare repository: https://doi.org/10.6084/m9.figshare.26809480.v1.
Acknowledgments
We thank the members of the MDACC clinical microbiology laboratory for providing isolates in timely fashion. The authors acknowledge the support of the high-performance computing research facility at the University of Texas MDACC for providing computational resources that have contributed to the research results reported in this paper. We acknowledge some pictures were created with BioRender.com. The authors thank Andrew Yang’s help for scientific writer and editorial assistance.
Financial Support
Core grant CA016672 (Advanced Technology Genomics Core—ATGC) and NIH grant 1S10OD024977-01 provided funding for the ATGC sequencing facility at MDACC. C.-T.W. was supported by a Peter and Cynthia Hu scholarship. W.C.S. was supported through the National Institute of Allergy and Infectious Diseases (NIAID) T32 AI141349 Training Program in Antimicrobial Resistance. Support for this study was also provided by NIAID grants R21AI151536 and P01AI152999 for S.A.S and T.J.T. and provided by the National Science Foundation (NSF: EF-2126387, IIS-2239114) to T.J.T.
Conflicts of Interest
The authors declare that there are no conflicts of interest.
Ethical Standards
This study received approval from the University of Texas MDACC Quality Improvement Assessment Board (protocol ID no. QIAB-1051).