Synthetic data — defined as artificial data having the same statistical properties as real data — has gained much attention recently as a privacy-enhancing technology. If done properly, the artificial data acts as a proxy for the real data, is completely anonymous, de-identified, and cannot be connected to the original data. Not only can synthetic data provide badly needed access to data used to fuel research, it also provides a potential remedy to privacy concerns.

Synthetic data is created from original individual data. A synthetic data engine and algorithms process this “real” data, learning correlations, trends, and individual behaviors. As the algorithm learns how customers behave, it generates new artificial individuals with the same correlations, patterns, and trends as the original data set, but no connection to actual individuals. The result, if done properly, is synthetic data that cannot be re-identified.

Original data carries with it use limitations. Original data is considered personal information if it identifies, relates to, is capable of being associated with or could reasonably be linked with an individual.  Original data carries legal obligations to obtain individual consent, implement security controls, and protect privacy rights. Once the synthetic data is produced, however, even broad definitions of personal information seems to exclude synthetic data, as it cannot reasonably be said to be linked with a particular individual. Thus, the use of synthetic data may provide a viable option, with less privacy risk, for entities operating in an over-regulated privacy industry.

Synthetic data, however, is not without limitations and there are factors which may cause the synthetic data not to be truly anonymized. For example, consider outlier information. If the original data contains unique outliers captured by a synthetic data engine, the synthetic data will unavoidably reproduce these outliers, and, depending on how unique the data set is, could identify an individual. In addition, there should be strong privacy provisions in agreements between the business and vendors who generate the synthetic data. Provisions should incorporate the appropriate end care of the original data, including prohibitions against re-identifying the data so as not to defeat the benefits of synthetic data.

As a relatively new technology, synthetic data, when done right and without any one-to-one ratio to the original data, appears to provide an avenue that would allow companies to utilize, share, and perhaps monetize synthetic data.