SiloFuse: Cross-silo Synthetic Data Generation with Latent Tabular Diffusion Models
CoRR(2024)
Abstract
Synthetic tabular data is crucial for sharing and augmenting data across
silos, especially for enterprises with proprietary data. However, existing
synthesizers are designed for centrally stored data. Hence, they struggle with
real-world scenarios where features are distributed across multiple silos,
necessitating on-premise data storage. We introduce SiloFuse, a novel
generative framework for high-quality synthesis from cross-silo tabular data.
To ensure privacy, SiloFuse utilizes a distributed latent tabular diffusion
architecture. Through autoencoders, latent representations are learned for each
client's features, masking their actual values. We employ stacked distributed
training to improve communication efficiency, reducing the number of rounds to
a single step. Under SiloFuse, we prove the impossibility of data
reconstruction for vertically partitioned synthesis and quantify privacy risks
through three attacks using our benchmark framework. Experimental results on
nine datasets showcase SiloFuse's competence against centralized
diffusion-based synthesizers. Notably, SiloFuse achieves 43.8 and 29.8 higher
percentage points over GANs in resemblance and utility. Experiments on
communication show stacked training's fixed cost compared to the growing costs
of end-to-end training as the number of training iterations increases.
Additionally, SiloFuse proves robust to feature permutations and varying
numbers of clients.
MoreTranslated text
AI Read Science
Must-Reading Tree
Example
Generate MRT to find the research sequence of this paper
Chat Paper
Summary is being generated by the instructions you defined