Synthetic Data for AI Gets a Little More Private

data38

Today, remarkable progress in AI and machine learning (ML) has become commonplace. Every week brings headlines in areas ranging from natural language processing to music synthesis. Between the occasionally breathless declarations of breakthroughs, however, it is still worth pausing – just occasionally – to ask how the field of AI can move so quickly.

There are many and varied root causes. A key driver has been specialist AI-oriented software frameworks such as PyTorch and TensorFlow that take much of the drudgery out of AI software development. Another important factor has been the willingness of many AI researchers to share breakthrough algorithms openly via rapid publication channels such as the arXiv repository.

Another critical factor is the widespread availability of datasets containing the raw material needed to train deep ML algorithms. The ‘garbage-in, garbage-out’ truism has never been more true: the performance of an ML algorithm is highly dependent on the quality and size of the dataset upon which it is trained. A good example of this is the ImageNet repository of structured images which brought forth the then astonishing classification performance of early convolutional neural networks. No longer need an AI system confuse a duck and a medium latte. Similar AI breakthroughs, from language translation to remote sensing interpretation, have often been rooted in public data collections. All this makes the creation and sharing of well-curated datasets of the utmost importance.

And yet there are highly important – and economically vast – areas of the economy where data sharing or publication faces fundamental obstacles or in some cases is just plain unlawful. Finance is one such sector. Another is healthcare, where regulations such as the U.S. Dept. of Health and Human Services (DHHS) Standards for Privacy of Individually Identifiable Health Information – sometimes referred to with reference to the parent Health Insurance Portability and Accountability Act (“HIPAA”) – prevent unauthorized disclosure.

So, what is the poor data scientist to do? In the past, there have been attempts to share data via obfuscation. That is, the removal of personally identifiable attributes of individuals’ records in a database while leaving sufficient information to model the key statistics of the original data. This approach has known and serious flaws, however, which allow processing of the obfuscated data to reconstruct the original records. The well-known recovery of personally identifiable data from an obfuscated Netflix database is just one such example.

Researchers turned therefore to more subtle methods. Instead of simply obscuring the data, for example, the next generation of techniques used generative deep neural methods such as variational autoencoders (VAE) or generative adversarial networks (GAN) to learn the statistics of a dataset and then generate entirely new records, none of which corresponded to any individual record in the original data. The resulting datasets were entirely synthetic. Unfortunately, such methods, while capable of generating massive datasets, had the unfortunate habit of occasionally burping out records which were very close indeed, and sometimes even identical, to records from the original (and private) data. New approaches to data synthesis were needed to tackle this important problem.

Fortunately, new tools are beginning to emerge. First, the mathematics of privacy are being put on a firm foundation. Differential privacy, which is one such concept, measures the degree to which individual and private records are disclosable by the synthetic data generation algorithm. Accordingly, a well-designed balance can often be found between the ability of an attacker to extract personal information from a synthetic dataset and the statistical fidelity of that dataset with the original and private data. New tools for training that are privacy-sensitive are also emerging, including the so-called differentially private stochastic gradient descent (DP-SGD) training method and its successors. DP-SGD allows a training procedure, as it follows the gradients of the training process, to reduce the likelihood of simply memorizing original records into the resulting model. And new generative AI methods are being employed to synthesize the synthetic data. These include the same diffusion algorithm which so effectively drives Stable Diffusion or the transformer algorithm which is at the heart of natural language tools such as ChatGPT.

Accordingly, we have a clear and present need – the generation of privacy-managed synthetic data – for which effective, scalable, and provably private methods might just be coming into view. And for the right team, with the right experience, there is enormous opportunity.