Can I trust my fake data?
A comprehensive quality assessment framework for synthetic tabular data in healthcare
Ensuring safe adoption of AI tools in healthcare hinges on access to sufficient data for training, testing and validation. Synthetic data has been suggested in response to privacy concerns and regulatory requirements and can be created by training a generator on real data to produce a dataset with similar statistical properties. Proposals of competing metrics with differing taxonomies for quality evaluation have resulted in a complex landscape. Optimising quality entails balancing considerations that make the data fit for use, yet relevant dimensions are left out of existing frameworks.
This paper presents a conceptual framework for quality assurance of synthetic data for AI applications in healthcare that aligns diverging taxonomies, expands on common quality dimensions to include the dimensions of Fairness and Carbon footprint, and proposes stages necessary to support real-life applications. Building trust in synthetic data by increasing transparency and reducing the safety risk will accelerate the development and uptake of trustworthy AI tools for the benefit of patients.