I. Introduction
The development of AI-based systems requires access to both large amounts of and high-quality data. This is particularly important for systems based on supervised and unsupervised machine learning. It is vital to note that these solutions account for the majority of deployments in the medical AI market, which is estimated to be worth $22.45 billion by 2023Footnote 1 and is projected to grow to $164.10 billion by 2029, at a compound annual growth rate of 42.4%.Footnote 2 However, medical data is subject to various restrictions when used to train medical AI systems including legal,Footnote 3 ethicalFootnote 4 and organisational factors.Footnote 5 This can be illustrated by the example of avoiding bias. In practice, this problem arises, for example, when databases are created from merging patients’ data and public data, which is common practice. Footnote 6 This helps to create models with high relevance – especially for image generation or natural language processing. Footnote 7 However, it is difficult to control their content,Footnote 8 so if biased data is used for training, model will be biased.
One of the solutions to the above-mentioned issues may be the use of synthetic data.Footnote 9 Currently, there is a discussion on the possibilities and conditions for its application in different sectorsFootnote 10 including the medical one.Footnote 11 This article contributes to this discussion by identifying the legal requirements of cybersecurity as one of the bases for risk assessment when using this data to train medical AI systems. It consists of four parts. The first discusses what synthetic data is and its prospects for use in the medical sector. The second focuses on the cybersecurity vulnerabilities of AI systems. The third presents the legal requirements for training medical AI systems from a cybersecurity perspective. The fourth concludes with assessing the feasibility of using synthetic data to train medical AI systems and making recommendations in this regard.
II. Synthetic data and its use in the medical sector
There is no legal definition of synthetic data. However, this concept is widely recognised in the technical literature, which makes it possible to establish factors distinguishing it from other types of data. Generally, two aspects have been highlighted in the studies. The first is its source: such data sets are created rather than collected. This has various consequences, among which is the conclusion that these sets will always be artificial in the sense that there are no equivalents in the “real” world. Moreover, a generating algorithm is required to create such data. It is underlined that many different methods may be employed to create synthetic data,Footnote 12 and that they may be based on different learning methodsFootnote 13 with their own advantages and disadvantages.Footnote 14
The second distinction is the relationship of the data to the real world. Synthetic data is a statistical reflection of the properties of the original set, which in most cases will be real-world data. In theory, data scientists should draw the same statistical conclusions from analysing a given set of synthetic data as they would from real data. In practice, synthetic data is statistically relatable to the set from which it was created. This does not mean that they accurately reflect the reality. How accurately they reflect the reality depends on the database from which it is generated, and such a selection depends on the creator of the synthetic data. Naturally, the objection can also be raised against real-world databases that they do not reflect the “real world” in epistemological terms. However, in the case of synthetic data, explaining the accuracy may be much more complicated and lead to errors.
There are many positive aspects to the generation of synthetic data, especially when this data contains medical information. The first is a significant reduction in the cost of preparing the database. This includes cleaning, labelling and organising the raw data sets. For example, data can be extracted from the electronic medical record (EMR) used in the hospital. It contains different types of data, in particular consultations and diagnostic data, but it can be much broader and include pharmacy prescriptions, insurance records, genomics-driven experiments such as genotyping or gene expression data.Footnote 15 It may also include automatically collected data from the Internet of Things (IoT).Footnote 16 In addition, healthcare professionals belong to different sectors, such as dentistry, medicine, nursing or physiotherapy, which may result in data input according to different methodologies. There is also a problem with data interoperability, especially when it comes from different medical facilities,Footnote 17 and the common practice has been to keep part of the documentation in the form of either handwritten notes or typed reports.Footnote 18 All these factors result in the situation where transforming EMRs into high-quality databases for AI training purposes can be very resource consuming. In the case of synthetic data, generating it according to a specific algorithm makes it structured according to a specific key.
Another issue is the ability to use synthetic data to easily increase the database variety in cases where access to patients is limited. This is a major challenge when the patient population is limited in number, as in the case of rare diseases, or when the ability to test is limited due to lack of patient consent, as in the case of pregnant people or children, or due to recruitment problems, which often occur in the case of disadvantaged groups.Footnote 19 The literature suggests that underrepresented datasets may be biasedFootnote 20 , and their use may lead to erroneous results and violations of fundamental rightsFootnote 21 . One of the solutions to this problem is the use of synthetic data, especially in areas where it is highly relevant, such as image generation or natural language processing.Footnote 22
A frequently raised argument for the use of synthetic data in medicine is that it ensures patient privacy. This is an essential element of its use from a legal and ethical perspective, but it can pose a major challenge in practice.Footnote 23 The proponents point out that synthetic data must be considered as anonymised data and as such is not subject to data protection regulations.Footnote 24 Research shows that such a claim can be true, but only in specific cases and when additional conditions are met.Footnote 25 It will also not be possible to treat it as pseudo-anonymised data in all cases, in particular if it shows sufficient structural equivalence to the original dataset or share relevant properties or patterns that could lead to the attribution of information to an individual.Footnote 26 This leads to the conclusion that the current state of the art does not allow synthetic data to be regarded as synonymous with anonymised data. Therefore, it can be assumed that the data protection legislation does apply to them. Even if we consider such data to be pseudo-anonymised, it is clear from the GDPR that technical and organisational measures must be taken to protect such data. Therefore, it can be assumed that in the case of synthetic data for medical AI training, a very careful analysis is required of the level of privacy offered by the collection and the risk of violating the rights of the individuals whose data were used to create the dataset. This does not preclude the use of synthetic data to train medical AI systems, but it does limit its use due to the boundaries imposed by data protection requirements.
III. Cybersecurity vulnerabilities of AI systems
Like any IT system, AI is vulnerable to cyber threats. In the case of AI, the European Union Cyber Security Agency lists dozens of threats classified in eight main areas.Footnote 27 These can be divided into two main groups. The first are threats that affect all ICT systems, such as the theft of information, preventing authorised users from accessing data, or unauthorised modifications of data in the system. In this case, countermeasures are relatively well known and described.Footnote 28
The second group are vulnerabilities specific to artificial intelligence. The most serious are data poisoning and adversarial attacks. The former is a type of attack where data or a model is altered to change the behaviour of an algorithm in a way that the attacker intends.Footnote 29 For example, instructing the algorithm that images of cancer represent a healthy tissue so that a similar image will be interpreted similarly in the future. Such attacks can occur at most stages of the project lifecycle, but the data collection and training stages of the algorithm are particularly vulnerable.Footnote 30
An adversarial attack consists of a small change to the algorithm’s input data that causes machine learning models to misclassify examples that are only slightly different from the correctly classified examples. Consequently, there are significant changes in the results obtained, leading to decision errorsFootnote 31 . For example, a change of one pixel in the image of a frog leads to the image being misclassified as a dog or a truckFootnote 32 . The effects of this attack usually occur during the last lifecycle of a project, i.e. during its practical application, and therefore are relatively difficult to detect. The task is also made more difficult by the fact that, in the case of images, it is essentially impossible to point out images that have been deliberately altered by a human.Footnote 33
Both types of attack have been reported by AI researchers for many years.Footnote 34 The literature describes their various taxonomies, methods of use and countermeasures.Footnote 35 System using synthetic data seems particularly vulnerable as these attacks can take place at any stage of a system’s lifecycle.Footnote 36 Moreover, data poisoning and adversarial attacks are characterised by their high level of effectiveness.Footnote 37 From a medical AI perspective, they are considered particularly dangerous because a successful attack can result in a life-threatening or fatal outcome for the patient.Footnote 38 This risk is exacerbated by the difficulty in finding effective ways to defend against this type of attack, in part due to the failure to reduce the risk of attack when training the algorithm on inconsistent training data, and the lack of correlation between the explainability of the algorithm and the effectiveness of the attack.Footnote 39
IV. The legal requirements for cybersecure training of medical AI systems
As stated above, there is no legal regulation that specifically addresses synthetic data, and therefore legal requirements will need to be reconstructed from regulations governing the cybersecurity of medical AI. The crucial issue in this regard is the distinction between AI that is a medical device and AI that does not fall into this category. In the latter case, although the impact of such solutions on the market may be significant, they cannot in principle be used by healthcare professionals.Footnote 40 For this reason, I will exclude them from further consideration.
In the case of medical devices, we can look for solutions in sectoral and horizontal cybersecurity regulations and data protection legislation. Among the sectoral regulations, the Medical Device Regulation (MDR) and the In Vitro Device Regulation (IVDR) will play an important role. These regulations are based on the idea that a device can only be placed on the market or put into service if it complies with the general safety and performance requirements set out in the regulations. This includes, in particular, compliance with the general safety and performance requirements set out in Annex I. In addition, depending on the class to which the device is assigned, the legislation may impose additional requirements, such as the implementation and maintenance of a risk management system, the conduct of a clinical evaluation of the device, including post-market surveillance, or the preparation and updating of technical documentation.
AI solutions can be classified as medical devices, both as a stand-alone algorithm and as part necessary for the functioning of the device.Footnote 41 In the EU, the MDR/IVDR do not specifically mention cybersecurity or AI based on the assumptions that underpin the approach to the medical device regulation in the EU.Footnote 42 This means that AI solutions are subject to the same rules as other medical devices. However, the regulations contain provisions for electronic programmable systems, understood as devices containing electronic programmable systems and software, which are devices in their own right. According to paragraph 17 of Annex 1, such solutions must be designed and manufactured in accordance with the state of the art taking into account the principles of development life cycle, risk management, including information security, verification and validation and ensure repeatability, reliability and performance in line with their intended use. In a similar vein, general guidelines are provided by the Medical Devices Coordination Group. They highlight the need for security by design, security verification and validation testing, and security update management, but do not address the specific requirements or risks of artificial intelligence technologies.
A sector-specific legislation that may be relevant to the regulation of cybersecurity requirements for medical AI is the AI Act. The proposalFootnote 43 included a solution to consider medical devices as high-risk systems, but such a qualification would not mean that the algorithm would be considered high-risk under the MDR/IVDR. At the same time, under Article 47, medical devices would not be subject to an additional conformity assessment procedure and notification of serious incidents or malfunctions will be limited to those that constitute a breach of obligations under European Union law intended to protect fundamental rights. The Council’s General ApproachFootnote 44 reinforces the concept of imposing the requirements of the AI Act on medical AI. Recital 54a states that the AI Act “should apply without prejudice to more specific provisions laid down in certain sectoral legislation of the New Legislative Framework with which this Regulation should apply jointly.” In addition, Article 6(1) states that an AI system which is itself a product covered by European Union harmonisation legislation shall be considered as high risk if it is subject to third party conformity assessment for the placing on the market or putting into service of that product in accordance with that legislation. This does not mean that an AI system will not automatically be considered “high risk” under the MDR/IVDR, but that for medical systems it will need to meet the additional requirements that the AI Act provides.
As regards horizontal legislation, the most important are the NIS and NIS2 Directives, which aim to create a legal framework for the development of national cybersecurity systems and networks for information exchange and cooperation between EU countries. The former, which is currently in force, imposes obligations on essential and important entities. Therefore, the mere fact of the implementation of any type of medical AI does not automatically bring it within the scope of the Directive. It is possible, however, that a decision by a Member State may confer on such an entity a status that becomes the source of its obligations.Footnote 45 Such an arrangement has led to differences in interpretation and thus in implementation in the Member States. Footnote 46 This led to regulatory work culminating in the adoption of the NIS2 Directive, which is due to be implemented in Member States by October 2024. Under this provision, entities manufacturing medical devices as defined in Article 2(1) of Regulation (EU) 2017/745 are qualified as “important entities,” and consequently all the associated obligations apply to them. Furthermore, entities manufacturing medical devices that are considered critical during a public health emergency as defined in Article 22 of Regulation 2022/123 will be considered as “essential entities.” It should be noted that these entities will be subject to additional obligations under the CER Directive,Footnote 47 which explicitly mentions its application in point 5 of the Annex.
Another horizontal piece of legislation is the Cybersecurity Act. The European Cybersecurity Certification Framework, based on Article 46 of this Act, may refer directly to medical devices. In the case of medical AI, there are currently no such schemes, but it is possible that they will emerge in the future. For the time being, however, this possibility remains theoretical.
The final group of regulations that may impose requirements on the training of medical artificial intelligence are data regulations. These fall into two groups. The first is legislation aimed at implementing the European Data Strategy adopted in 2020. Footnote 48 These include legislation on the creation of data spacesFootnote 49 or the harmonisation of data access rules.Footnote 50 Studies on their impact on cybersecurity are available in the literature,Footnote 51 but as they are at the stage of legislative initiatives, I will not discuss them further due to possible major changes in the final version.
With regard to data protection legislation, the most important role is played by the GDPR, in particular Articles 5(1)(f) and 32. Given that personal data is a concept that is interpreted very broadly,Footnote 52 and that when personal and non-personal datasets are linked, the GDPR rules must be applied to all data,Footnote 53 it is reasonable to assume that the requirements of this legislation will apply to the vast majority of datasets used to train medical AI.
V. Conclusions
Synthetic data has many advantages that make it potentially useful for training artificial intelligence systems. At the same time, using it for this purpose introduces an additional layer that needs to be considered when analysing cybersecurity risks. It should be noted that the threat of an attack on medical AI is real. Although no such incident has been reported to date, given that medical infrastructure is one of the main targets of cyber attacks, it is reasonable to assume that such an attack will occur at some point. One way to mitigate the risk is to comply with legal requirements. Looking at legislation at the EU level, there is a mosaic of regulatory requirements that have their sources in different pieces of legislation. It includes regulations governing putting on the market of medical devices, norms on artificial intelligence, and horizontal regulations on cybersecurity. Several conclusions can be drawn from this.
The first conclusion is that there is no general prohibition on the use of synthetic data to train medical AI systems. However, it should be pointed out that this results more from the fact that the issue is relatively new and its legal implications are only now being analysed. Nevertheless, according to the principle “quod lege non prohibitum, licitum est” (what is not forbidden by law is allowed), producers of medical AI can use this data. At the same time, they will have to take full responsibility for cybersecurity of system they produce and put on the market.
A further argument in favour of the possibility of using synthetic data in training medical AI algorithms is the assumption of technological neutrality of legal acts concerning the regulation of digital technologies, including cybersecurity.Footnote 54 According to this assumption,Footnote 55 the deployment of technological solutions unknown at the time of the adoption of the legislation is covered by the obligations that follow from it. In practice, it involves applying to legal text a risk-based approach to these solutions and formulating general obligations to design and manufacture them according to the state of the art. It should be emphasised that the legal acts analysed in the framework contain such clauses. Consequently, they can be used as a basis for setting the boundaries for cybersecure training of medical AI algorithms based on synthetic data.
To conclude, using synthetic data to train medical AI requires a very detailed recognition and assessment of the risks involved. Meeting the legal requirements that are imposed by the regulations governing the cybersecurity of medical devices gives an indication of the risks that need to be taken into account. However, it appears that meeting legal requirements may not be sufficient to effectively prevent attacks. Thus, manufacturers of medical AI should also take into account areas that for various reasons are not regulated by law but are driven by industry standards. These include, for example, good practices in data collection and management or the creation of project documentation.Footnote 56 A comprehensive approach reduces risks of a different nature, which will positively influence the level of cybersecurity of the AI solution that synthetic data has been used to train.