Synthetic Data — Anonymisation Groundhog Day. (arXiv:2011.07018v3 [cs.LG] UPDATED)

Synthetic data has been advertised as a silver-bullet solution to
privacy-preserving data publishing that addresses the shortcomings of
traditional anonymisation techniques. The promise is that synthetic data drawn
from generative models preserves the statistical properties of the original
dataset but, at the same time, provides perfect protection against privacy
attacks. In this work, we present the first quantitative evaluation of the
privacy gain of synthetic data publishing and compare it to that of previous
anonymisation techniques.

Our evaluation of a wide range of state-of-the-art generative models
demonstrates that synthetic data either does not prevent inference attacks or
does not retain data utility. In other words, we empirically show that
synthetic data suffers from the same limitations as traditional anonymisation

Furthermore, we find that, in contrast to traditional anonymisation, the
privacy-utility tradeoff of synthetic data publishing is hard to predict.
Because it is impossible to predict what signals a synthetic dataset will
preserve and what information will be lost, synthetic data leads to a highly
variable privacy gain and unpredictable utility loss. In summary, we find that
synthetic data is far from the holy grail of privacy-preserving data