A Comprehensive Review of Diffusion Models for Electronic Health Record Data Generation
Electronic Health Records (EHR) contain vast amounts of biomedical knowledge, serving as a crucial resource for healthcare data analysis. However, privacy protection and data-sharing restrictions pose significant challenges, hindering the application of machine learning in medical research. To address these limitations, researchers have turned to generative modeling techniques, particularly diffusion models, to synthesize realistic EHR data while preserving privacy. This article provides an in-depth exploration of diffusion models, their evolution, mathematical foundations, and applications in EHR data generation, comparing their advantages and limitations with other generative approaches.
Introduction
EHR data capture patient health information over time, including disease progression, treatment responses, and personal medical histories. These records are invaluable for developing computational methods in dynamic disease treatment, automated diagnosis, and biomedical natural language processing. However, EHRs often contain sensitive patient information, making data sharing and analysis difficult due to privacy concerns. Traditional anonymization techniques are cumbersome, costly, and may distort critical data features, reducing their utility. Moreover, even encrypted data can be vulnerable to privacy attacks.
Synthetic EHR data generation offers a promising solution by creating artificial datasets that mimic real patient records without exposing individual identities. High-quality synthetic EHR data must satisfy two key properties: high fidelity (ensuring the synthetic data performs similarly to real data in downstream tasks) and privacy preservation (preventing leakage of real patient information). While generative adversarial networks (GANs) and autoencoders (AEs) have been widely used for EHR synthesis, they suffer from issues such as mode collapse and training instability. Diffusion models, a newer class of generative models, have demonstrated superior performance in generating high-quality, diverse samples across domains such as text, audio, and computer vision. Their advantages over GANs include stable training, better sample diversity, and faster generation compared to autoregressive models.
The Origin and Development of Diffusion Models
The Proposal of Diffusion Models
Diffusion models were first introduced in 2015 as diffusion probabilistic models (DPMs), inspired by non-equilibrium statistical physics. The core idea involves two processes: a forward process that gradually adds Gaussian noise to data until it becomes random noise, and a reverse process that learns to denoise and reconstruct the original data distribution. Early diffusion models were limited to simple datasets, but subsequent advancements expanded their applicability to complex data.
Evolution of Diffusion Models
Denoising Diffusion Probabilistic Models (DDPM)
In 2020, denoising diffusion probabilistic models (DDPMs) emerged as a breakthrough, establishing diffusion models as a leading approach for image generation. DDPMs use two Markov chains—a forward chain that corrupts data with noise and a reverse chain that reconstructs the data. The reverse chain employs a neural network (typically a U-Net) to predict and remove noise iteratively. Later improvements, such as the denoising diffusion implicit model (DDIM), reduced sampling steps and accelerated generation.
Score-Based Generative Models (SGM)
Score-based generative models (SGMs), introduced in 2019, learn the gradient (score) of the data distribution rather than the distribution itself. By leveraging Langevin dynamics, SGMs generate samples by iteratively refining noise along learned gradients. SGMs and DDPMs were later unified under a stochastic differential equation (SDE) framework, demonstrating their equivalence in certain settings.
Unified Framework
The unification of DDPMs and SGMs under continuous-time diffusion via SDEs provided a comprehensive theoretical foundation. This framework allows flexible sampling techniques, including predictor-corrector methods and probability flow ordinary differential equations (ODEs), enhancing generation quality and efficiency.
Principles and Mathematical Foundations of Diffusion Models
Diffusion models can be categorized into three main formulations: DDPMs, SGMs, and SDE-based models.
Denoising Diffusion Probabilistic Models (DDPM)
Forward Process
The forward process gradually adds Gaussian noise to data over multiple steps, transforming it into a noise distribution. Each step is defined by a noise schedule that controls the rate of corruption.
Reverse Process
The reverse process learns to denoise data by predicting the noise at each step. A neural network, often a U-Net, is trained to estimate the noise, enabling the reconstruction of clean samples from random noise.
Score-Based Generative Models (SGM)
SGMs estimate the score function (gradient of the log probability density) of the data distribution. By iteratively refining noise along the score function, SGMs generate samples that match the target distribution. Training involves minimizing a weighted score-matching objective, ensuring accurate gradient estimation.
Stochastic Differential Equations (SDEs)
SDEs provide a continuous-time framework unifying DDPMs and SGMs. The forward SDE defines the noise corruption process, while the reverse SDE generates samples by reversing the diffusion process. Probability flow ODEs offer deterministic sampling alternatives, improving efficiency.
Applications of Diffusion Models in EHR Data Generation
EHR data generation using diffusion models addresses privacy concerns while enabling data sharing for research. Below, we review key studies in this domain.
MedDiff
MedDiff was the first diffusion model applied to EHR data generation. It uses an improved U-Net architecture to capture feature correlations in patient records. By integrating Anderson acceleration, MedDiff enhances generation speed while maintaining high fidelity. Evaluations on MIMIC-III demonstrated its superiority over GAN-based methods in preserving statistical properties.
EHRDiff
EHRDiff employs an SGM-based approach, combining deterministic ODE solvers with adaptive noise decoupling. It outperforms GAN models in generating realistic EHR data on the MIMIC-II dataset. However, its computational cost and limited generalization warrant further investigation.
ScoEHR
ScoEHR integrates autoencoders with continuous-time diffusion models to handle both discrete and continuous EHR features. It excels in clinical validity, as verified by physician evaluations. Future work may explore multimodal data integration and enhanced privacy mechanisms.
TabDDPM
TabDDPM generates mixed-type EHR data (continuous and categorical) using Gaussian and multinomial diffusion processes. While it achieves high data utility, its privacy risks are higher due to the quality of generated samples.
DPM for Longitudinal Data
This model generates longitudinal EHR data, capturing temporal dependencies in clinical variables. It shows promise in applications like reinforcement learning for treatment optimization but requires further validation on diverse datasets.
TIMEDIFF
TIMEDIFF introduces a bidirectional recurrent neural network (BRNN) architecture for time-series EHR generation. Its hybrid diffusion approach supports both continuous and discrete variables, achieving state-of-the-art performance in utility and privacy metrics.
Challenges and Future Directions
Despite progress, several challenges remain:
- Standardized Evaluation Metrics – Current evaluation methods vary, hindering fair comparisons. Future work should establish unified benchmarks for fidelity, privacy, and clinical relevance.
- Privacy-Utility Trade-off – High-quality synthetic data may increase re-identification risks. Techniques like differential privacy could balance this trade-off.
- Multimodal Data Generation – Most models focus on structured data; extending diffusion models to text, images, and time-series EHRs would enhance utility.
- Downstream Task Generalization – Synthetic data should improve predictive models in real-world applications, necessitating rigorous validation.
Conclusion
Diffusion models represent a transformative approach to EHR data generation, offering advantages in stability, diversity, and privacy preservation over traditional GANs and VAEs. While challenges remain in evaluation, generalization, and multimodal synthesis, ongoing advancements promise to unlock new possibilities for synthetic EHRs in medical research. Future efforts should prioritize standardized benchmarks, improved privacy guarantees, and broader clinical applicability.
doi.org/10.19734/j.issn.1001-3695.2024.04.0122
Was this helpful?
0 / 0