Hostname: page-component-5cf477f64f-mgq6s Total loading time: 0 Render date: 2025-03-31T09:18:21.054Z Has data issue: false hasContentIssue false

Data science and artificial intelligence in biology, health, and healthcare

Published online by Cambridge University Press:  14 February 2025

Peter L. Elkin*
Affiliation:
Department of Biomedical Informatics, Jacobs School of Medicine and Biomedical Sciences, University at Buffalo, Buffalo, NY, USA
Christopher Lindsell
Affiliation:
Duke University, Durham, NC, USA
Julio Facelli
Affiliation:
Department of Biomedical Informatics, University of Utah, Salt Lake City, UT, USA
Manisha Desai
Affiliation:
Stanford University School of Medicine, Stanford, CA, USA
Chunhua Weng
Affiliation:
Department of Biomedical Informatics, Columbia University, New York, NY, USA
Heidi Spratt
Affiliation:
Department of Biostatistics and Data Science, University of Texas Medical Branch, Galveston, TX, USA
Shari Messinger
Affiliation:
Department of Public Health Sciences, Division of Biostatistics and Bioinformatics, University of Miami, Coral Gables, FL, USA
Lemuel Russell Waitman
Affiliation:
Department of Biomedical Informatics, Biostatistics and Medical Epidemiology, Missouri University School of Medicine, Cincinnati, OH, USA
JaMor Hairston
Affiliation:
Department of Biomedical Informatics, Emory University School of Medicine, Birmingham, AL, USA
Ruth O’Hara
Affiliation:
Stanford University School of Medicine, Stanford, CA, USA
Jareen Meinzen-Derr
Affiliation:
Department of Pediatrics, Cincinnati Children’s Hospital Medical Center, University of Cincinnati College of Medicine, Cincinnati, OH, USA
*
Corresponding author: P.L. Elkin; Email: [email protected]
Rights & Permissions [Opens in a new window]

Abstract

Type
Perspective
Creative Commons
Creative Common License - CCCreative Common License - BY
This is an Open Access article, distributed under the terms of the Creative Commons Attribution licence (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted re-use, distribution and reproduction, provided the original article is properly cited.
Copyright
© The Author(s), 2025. Published by Cambridge University Press on behalf of Association for Clinical and Translational Science

Biomedical science is at an inflection point where classical thinking about how to conduct translational research is undergoing a transformation. Enabled by now-available data sources combined with exponential advances in AI, Statistics, and Computer Science, we can achieve advances that require levels of complexity that we could not have previously thought possible [Reference Friedman, Delgado and Weissman1]. To investigate modern biomedical questions of significance requires data scientists to serve as partners in data-driven science, as well as continuing their leadership in data science, biomedical informatics, and biostatistics research [Reference Chung, Fong, Walters, Aghaeepour, Yetisgen and O’Reilly-Shah2]. The future is hobbled by an insufficient data science workforce cross-trained in the biomedical sciences. As resources shift to centralized and connected repositories of increasingly complex data, the data science and AI community urges immediate action to ensure we draw unbiased conclusions from the vast information available to modern researchers [Reference Zink, Chernew and Neprash3]. Specifically,

  • In order for science to be data-driven, qualified data scientists should be integrally engaged from the project onset to limit fatal flaws in methodology, analysis, and model development.

  • Major data science initiatives should be coordinated to prevent redundancy and inefficiencies. This includes the consistent use of commonly accepted standards for codifying health data.

  • It is critical to rapidly grow a qualified workforce of data scientists who can be counted on to exhibit a common core set of competencies, as well as to provide quantitative science training in the biomedical science curricula.

  • Artificial Intelligence (AI) in biomedicine is a rapidly evolving field with significant promise and also significant risk. Attention to the elimination of bias, Ethical Legal and Social Issues (ELSI), the selection of data and populations on which to train models and other issues must continue to be addressed.

  • As we learn how to safely and effectively train and use AI in healthcare and research applications, it is essential that we establish frameworks for evaluating when and if data science methods and AI are fit for purpose.

  • Emphasizing the FAIR principles (Findable, Accessible, Interoperable and Reusable), along with data cleaning, storage, indexing and data exchange and accessibility of data fit for purpose should be emphasized. Publishing high-quality data resources should be rewarded similarly to publications in peer-reviewed journals.

  • Congress should increase the NIH budget specifically earmarked for data science research and education, informatics, and biostatistics to address this immediate and critical need. We suggest that good homes for the funding include NCATS, NLM, and then other interested ICs.

  1. 1. Set expectations for the early inclusion of data scientists in all clinical biomedical research.

Data scientists provide novel and independent contributions requiring a deep knowledge base and creativity. While team science is lauded, until data scientists are recognized for their pivotal contributions as collaborating scientists on the team, there will continue to be a barrier for data scientists to focus on addressing biomedical questions. Funding agencies can continue to help address this barrier by ensuring that data-centered science includes data scientists in named, recognized leadership roles, such as multi-PI roles, as is being increasingly observed.

  1. 1. Coordinate major data science and AI initiatives among federal agencies

There are increasing demands on the biomedical research community to meet the data science needs of multiple federal agencies. When multiple common data models are imposed, it creates competing workstreams for an already overburdened and under-supported health information technology ecosystem and for the data science workforce. Lack of consistency creates inefficiencies, and sharing of noninteroperable data can lead to errors in reporting. Mapping between data standards, data repositories, and common data models creates efficiency but at the cost of information and quality. Deciding on a common set of interoperability standards would be the true path to semantic interoperability for our research and clinical care enterprise. Semantic Interoperability requires formalisms well beyond selecting a common data model.

  1. 1. Grow the data science and AI workforce and establish core competencies and subspecialties for data scientists in the fields of biomedical research.

There is a national shortage of data scientists to meet the needs of the future as envisioned. This deficit can be addressed both by creating new training programs and expanding existing programs. As established and emerging data sciences including AI and machine learning evolve, recognition of subspecialty skills when forming scientific teams is critical – biostatisticians, informaticians, and data engineers are not interchangeable. The paradigm for training may also need to evolve: as programs sprint to keep pace with rapidly changing technology it is critical to ensure that the core competencies needed for extracting knowledge from data are addressed alongside the range of technical proficiencies that can be achieved. Doctoral, postdoctoral, and continuing data science training and education programs should consider the sheer breadth of the activities that constitute data science and ensure a common understanding of the general and specialty-specific knowledge base and competencies needed to train a functional data scientist workforce [Reference Resendez, Franklin, Stephens, Maness, Chamala and Elkin4]. We would seek to create a generation of life-long learners.

Addressing these challenges will require a major coordinated response from academia, industry, and government agencies. The many existing efforts can be optimized for impact with a clearer recognition and understanding of both core competencies and subspecialty skills, by applying standards for technology stacks and data stores, and by providing a reward and value system for data scientists contributing to biomedical research. Without a coordinated focus on these issues, we expect the gaps in expertise across the research enterprise to widen. However, greater attention to the biomedical data science workforce has the potential to catalyze efficiencies and shorten timelines to have robust answers to biomedical questions bringing new treatments to patients more rapidly. Strong data science is expected to result in not only high-quality reproducible research but also a transformation of how clinical and translational research is performed. Specifically, we can use existing data to help us understand what studies are more likely to show positive results. We can judge likely toxicities ahead of expensive phase III clinical trials, and in so doing limit the number of negative trials. We can improve recruitment to execution of clinical trials using real-world evidence. We can more effectively do postmarket surveillance [Reference Elkin, Mullin and Mardekian5] to find rare but serious side effects and deliver the right treatments to the right patients at the right time.

Author contributions

Peter L. Elkin wrote the article, edited, and reviewed; Christopher Lindsell2 wrote the article, edited, and reviewed; Julio Facilli edited and reviewed; Manisha Desai edited and reviewed; Chunhua Weng edited and reviewed; Heidi Spratt edited and reviewed; Shari Messinger edited and reviewed; Russ Waitman edited and reviewed; Jamor Hairston edited and reviewed; Ruth O’Hara edited and reviewed; and Jareen Meinzen-Derr authored and reviewed.

Funding statement

This work has been supported in part by grants from NIH NLM T15LM012495, R25LM014213, NIAAA R21AA026954, R33AA0226954, and NCATS UL1TR001412 (PLE). We acknowledge resources/support from the Miami Clinical and Translational Science Institute, which is supported by the National Center for Advancing Translational Sciences, National Institutes of Health, Award Number UM1TR004556(SM). Research reported in this publication was supported by the Washington University Institute of Clinical and Translational Sciences grant UL1TR002345 from the National Center for Advancing Translational Sciences (NCATS) of the National Institutes of Health (NIH) (RW). This publication was supported by the National Center for Advancing Translational Sciences, National Institutes of Health, through Grant Number UL1TR001873 (CW). This publication was supported by the National Center for Advancing Translational Sciences, National Institutes of Health CTSA award to the Utah Clinical and Translational Science Institute UM1TR004409 (JF). Research reported in this publication was supported by the Stanford University Clinical and Translational Sciences Award UL1TR003142 from the National Center for Advancing Translational Sciences (NCATS) of the National Institutes of Health (NIH) (MD). This publication was supported by the National Center for Advancing Translational Sciences, National Institutes of Health, through Grant Number UL1TR001425(JMD).

Competing interests

None.

References

Friedman, AB, Delgado, MK, Weissman, GE. Artificial intelligence for emergency care triage—Much promise, but still much to learn. JAMA Netw Open. 2024;7(5):e248857. doi: 10.1001/jamanetworkopen.2024.8857.Google Scholar
Chung, P, Fong, CT, Walters, AM, Aghaeepour, N, Yetisgen, M, O’Reilly-Shah, VN. Large language model capabilities in perioperative risk prediction and prognostication. JAMA Surg. 2024;159:928. doi: 10.1001/jamasurg.2024.1621.Google Scholar
Zink, A, Chernew, ME, Neprash, HT. How should medicare pay for artificial intelligence? JAMA Intern Med. 2024;184:863. doi: 10.1001/jamainternmed.2024.1648.Google Scholar
Resendez, S, Franklin, G, Stephens, R, Maness, H, Chamala, S, Elkin, PL. Analyzing the efficacy of an open access biomedical informatics boot camp. Stud Health Technol Inform. 2024;316:15451546. doi: 10.3233/SHTI240711.Google Scholar
Elkin, PL, Mullin, S, Mardekian, J, et al. Using artificial intelligence with natural language processing to combine electronic health record’s structured and free text data to identify nonvalvular atrial fibrillation to decrease strokes and death: evaluation and case-control study. J Med Internet Res. 2021;23(11):e28946. doi: 10.2196/28946.Google Scholar