The promise and pitfalls of synthetic data

Computer scientists and statisticians are starting to create datasets that mimic important properties of the real thing, which could help ease privacy concerns.

The vast amounts of data collected by governments, health-care organizations, financial institutions and other groups offer countless opportunities for insights. If companies and researchers could share and work with this data, they could track the emergence of rare diseases, prevent fraud and track the success of social policies, to name a few examples.

“Society as a whole is moving into what we call Big Data,” said Dean Eurich, a professor in the school of public health at the University of Alberta. “As you start to pull all this big data together and start to use it in research, the owners of the data get really anxious around identification and privacy concerns.”

Dr. Eurich and others have a solution: synthetic data. Using algorithms and machine learning, computer scientists and statisticians can create a mirror dataset that mimics important properties of the original. (This approach can even generate fake images. In fact, deepfake videos and pictures are a type of synthetic data.)

While identifying information is routinely removed from real datasets, this goes one step further. “You create a model for the data based on correlations, and then you create people from that model,” said Anne-Sophie Charest, associate professor of mathematics and statistics at Université Laval. That way it’s harder to re-identify someone by triangulating different pieces of anonymous information, such as their job or health status.

Possibilities for research, teaching

Governments and companies are currently bound by privacy laws regarding how they collect and share data. (Those rules may become increasingly strict under federal Bill C-11, the Digital Charter Implementation Act, which was tabled in December 2020 but has not yet been passed into law.)

As a result, university researchers often must apply for permission to access datasets. “This is a years-long process. In the end you can find out the data is not what you need,” said Khaled El Emam, Canada Research Chair in medical artificial intelligence at the University of Ottawa. That process may be too long for, say, a master’s student. (When researchers access materials via locations such as a Statistics Canada Research Data Centre, they even have to leave laptops and cellphones at the door.)

“As a researcher, I can access data through the various agreements that are in place for universities to work with governments and other agencies,” said Dr. Eurich. “But that’s where the relationship ends, it ends with me.” Researchers cannot share data with students or collaborators outside their university, unless they’re part of the project. Private organizations also collect data internally but it “cannot be shipped out, even to trusted collaborators,” said Raymond Ng, professor of computer science at the University of British Columbia.

These limitations hold back collaboration and research. “One of the general benefits of sharing data is simply making the volume larger, so you get a sample-size benefit,” said Dr. Ng. He noted that a lot of health research now focuses on rare diseases, but a single province or hospital often cannot collect enough material on such conditions to generate useful results.

If universities could better link up with private companies around sharing data and research expertise, it could become a lucrative revenue stream for schools and spin-off companies. “Knowing some things could save a lot of money, from a company’s perspective,” said Dr. Eurich. Banks, for instance, want broad fraud metrics while pharmaceutical companies want to better understand the market potential for future drugs.

There are also potential benefits for teaching, which often relies on well-worn datasets. “All the data we use to train epidemiologists, data scientists and computer scientists has been massaged over, cleaned up, made into a perfect data set for them,” says Dr. Eurich. “They leave the university, especially as undergrads, never having seen a real life dataset.”

Computing limitations, security concerns

But while synthetic data holds a lot of promise, it’s still in the development stages and there are drawbacks.

For starters, someone has to custom build an algorithm in order to generate the data set. As Dr. Charest put it, “you don’t have a machine where you can just throw in the data and it spits out synthetic data.”

The process burns through computing power. For one project, Dr. Ng is taking 200 images of tumours and generating 200 images of synthetic tumours that cannot be traced back to their hosts. That took 10 hours of computing time — and it’s just one slice of a larger project. “This is very, very computationally intensive with today’s technology,” Dr. Ng said. When that work is done, someone has to check to see if the dataset can reliably answer research questions.

What’s more, synthetic data can still lead to privacy breaches – a hacker could mine the set and correlate information with other sources. Dr. Charest said that differential privacy, which uses a mathematical formula to assess this risk, is becoming a popular method to assess synthetic and other datasets.

Dr. Eurich is working with a number of organizations in Alberta to create synthetic datasets based on health records in the province, and his research team has already created a subset based on opioid prescriptions. He said many agencies around the world, including Health Canada, try to keep the risk of a privacy breach below 10 per cent. “The analysis we’ve done with our data is coming it at around three per cent,” he said.

Ready for the mainstream?

Dr. El Emam said many groups have been able to produce small, simple synthetic data sets so far. However, very large sets or those that are complex – containing not just numbers or facts but images, for instance – remain a challenge. So are data sets that include rare events. (Think of a heart monitor that can produce months of similar readouts and then, in some patients, a cardiac episode.) Data that includes long sequences, such as DNA, are also difficult. “That will be the next frontier,” said Dr. El Emam, especially since creating synthetic datasets related to DNA could be hugely valuable.

While researchers grapple with how to create foolproof algorithms, the concept of synthetic data is gaining profile. Most notably, the U.S. Census Bureau will release one component of the 2020 census in synthetic form. Dr. Eurich said Statistics Canada is interested in synthetic data too.

While it might not be ready for the mainstream just yet, expect that it will be in the coming years. “I think within the next decade that synthetic data will make up the bulk of data,” said Dr. Eurich.

COMMENTS

University Affairs moderates all comments according to the following guidelines. If approved, comments generally appear within one business day. We may republish particularly insightful remarks in our print edition or elsewhere.

Cancel reply

One comment

Piers Nash / December 15, 2021 at 11:28

Synthetic data is another obfuscation technique. The issue is that anything that obscures data, reduces the utility of said data. The closer the synthetic data mimics the real data, the more likely it is to be possible to re-identify. Rare events are easily missed or misunderstood with synthetic data.

In addition, assuming synthetic data is valuable, transfer of the data means transferring that value, depriving the data creator and the subjects of a share in that value. Imagine an amazing synthetic data set that allows billions in profit. Should the hospitals, health systems and even patients/subjects not share this value?

There are better options in development. Allow use of data without transfer. Allow rental of data without transfer. Use privacy and intellectual privacy preserving double-blind compute environments that never need data to be copied, moved or lost/exposed.

This fits the rubric of ethical data sharing. I would love to explore this topic with anyone who is interested.