The role of synthetic data in healthcare

Trust is key to unlock the full potential for effective, secure and fair use of synthetic data in healthcare.

Healthcare systems around the world face increasing patient volumes and a dramatic rise in health challenges. 

As healthcare costs continue to rise, governments, regulators, and providers are increasingly challenged to innovate and transform healthcare delivery models. The integration of AI technologies is becoming an ever more attractive solution. 

AI implementation at high speed in society puts pressure on implementation in the health sector to improve patient care and reduce costs. 

This creates a dual pressure—accelerating technology integration while managing compliance—adding complexity to the implementation process. 

Synthetic data is artificially generated to resemble real-world data. It is either structured (quantitative, tabular) or unstructured (images, text, video), and has emerged as a powerful solution to address data access challenges in the healthcare sector. 

 

How can synthetic data improve AI healthcare?  

The use of AI-generated synthetic data supports use cases where data is scarce or accessing it costly. It can potentially enhance healthcare across various domains, such as clinical trials and drug development, patient care and diagnostics, personalized medicine, medical imaging, public health planning, disease outbreak prediction, resource allocation, policy impact assessment, virtual reality simulations, and academic research. 

 

Potential benefits but huge quality assurance gaps 

Despite its numerous potential benefits, ensuring its effective, secure, and fair use requires several critical steps as outlined further down. 

One of the primary challenges to implement AI in the health sector is the availability of high-quality data. Effective AI systems require vast amounts of accurate and complete, labelled data, which is often not available due to high costs and privacy concerns and regulatory constraints. 

Synthetic datasets can be generated to represent diverse patient populations and enriching real-world data with previously unobserved cases with a lower privacy risk, thus addressing the challenge of limited access to high-quality data for validation purposes. 
 
It can augment machine learning algorithms and improve public health models by providing a rich source of information that reflects real-world scenarios without compromising patient confidentiality, as it preserves the statistical properties of the real data it replicates. Deep generative models have the ability to learn the underlying statistical patterns of a dataset and thereby maintain the interrelationships between data points and capture underlying medical information. Synthetic data can simulate electronic health records in various formats, creating a complete patient journey that helps improve the quality of care and inform the development of new therapeutics. 

Using synthetic data rather than real-world data can protect patient confidentiality, as it offers a low value target, minimalizing risk and impact of breaches. This is particularly useful when sharing data with third parties. This in turn can improve data availability by alleviating concerns about leaking original data, making health care providers and health practitioners more willing and able to share their data for research and other purposes. 

The creation of new datasets that address gaps or biases in existing data, can make AI models produce more accurate and reliable analyses. 

 

Using synthetic data with confidence and DNV’s role 

Despite its potential, synthetic data in healthcare is not yet mature due to several challenges. The lack of standardized quality assessment metrics, potential for bias amplification, and concerns about data privacy versus realism and accuracy limit its widespread adoption and reliability for critical healthcare applications. 

This is the datascape DNV invests in, particularly through several research initiatives, including participation in the EU research project consortium SYNTHIA and a Ph.D. project on Synthetic Data. The research focuses on how synthetic data can play a role in safe uptake of AI in the healthcare sector through quality assurance and regulatory processes.  

 

DNV's research on quality assurance is centered around five dimensions that assess fitness for use of a dataset 

  1. Similarity

Synthetic data must closely resemble real patient data in terms of patterns and characteristics. This similarity ensures that AI systems trained on synthetic data can make accurate predictions and decisions when applied in real healthcare settings. While it should mimic the properties of the original data, it should not become too similar and pose a reidentification risk. 

  1. Utility

Utility assesses whether synthetic data can replace real data for specific analytical tasks. This dimension focuses on task-specific performance metrics, such as: 

  • Predictive Accuracy: Comparing the performance of machine learning models trained on synthetic versus real data. 
  • Downstream Applications: Evaluating how well synthetic data supports tasks like risk prediction, diagnosis, or treatment planning.

Utility is critical for determining the practical applicability of synthetic data, especially in high-stakes healthcare scenarios where data-driven decisions can impact patient care. 

  1. Privacy

 One of the main advantages of synthetic data is that it protects patient privacy. By creating data that does not correspond to real individuals, synthetic datasets allow researchers to analyze trends and patterns without risking exposure of sensitive personal information. 

  1. Fairness

 It is essential that synthetic data does not reinforce existing biases found in real-world data. Fairness means ensuring that the synthetic datasets represent diverse populations accurately, which helps prevent discrimination in healthcare outcomes when AI systems are deployed. Correcting for biases in the data can improve coverage to include ignored and left out groups and minorities. 

  1. Carbon Footprint

 The process of generating and using synthetic data should consider its environmental impact. Complex and compute heavy generative models consume significant resources both in terms of training time and energy consumption. The carbon footprint of a model should be documented when benchmarking methods, to guide the community in a balanced choice of methods to ensure sustainability in healthcare practices.

 
Way forward 

To accelerate the development and uptake of trustworthy AI tools for the benefit of patients, we need to increase transparency in how they are built and used and reduce safety risks. 
 
DNV’s ambition is to support a broad audience of potential users of synthetic data in healthcare and provide a basis for further practical implementations, with a common vocabulary to support cross-discipline communication and build trust in synthetic data specifically – and in AI systems in general.