News Research

The promise and pitfalls of synthetic data

Computer scientists and statisticians are starting to create datasets that mimic important properties of the real thing, which could help ease privacy concerns.

Diane Peters

December 13, 2021

Posted in

Articles

Lire cet article en français 1 Comments

The vast amounts of data collected by governments, health-care organizations, financial institutions and other groups offer countless opportunities for insights. If companies and researchers could share and work with this data, they could track the emergence of rare diseases, prevent fraud and track the success of social policies, to name a few examples.

“Society as a whole is moving into what we call Big Data,” said Dean Eurich, a professor in the school of public health at the University of Alberta. “As you start to pull all this big data together and start to use it in research, the owners of the data get really anxious around identification and privacy concerns.”

Dr. Eurich and others have a solution: synthetic data. Using algorithms and machine learning, computer scientists and statisticians can create a mirror dataset that mimics important properties of the original. (This approach can even generate fake images. In fact, deepfake videos and pictures are a type of synthetic data.)

While identifying information is routinely removed from real datasets, this goes one step further. “You create a model for the data based on correlations, and then you create people from that model,” said Anne-Sophie Charest, associate professor of mathematics and statistics at Université Laval. That way it’s harder to re-identify someone by triangulating different pieces of anonymous information, such as their job or health status.

Possibilities for research, teaching

Governments and companies are currently bound by privacy laws regarding how they collect and share data. (Those rules may become increasingly strict under federal Bill C-11, the Digital Charter Implementation Act, which was tabled in December 2020 but has not yet been passed into law.)

As a result, university researchers often must apply for permission to access datasets. “This is a years-long process. In the end you can find out the data is not what you need,” said Khaled El Emam, Canada Research Chair in medical artificial intelligence at the University of Ottawa. That process may be too long for, say, a master’s student. (When researchers access materials via locations such as a Statistics Canada Research Data Centre, they even have to leave laptops and cellphones at the door.)

“As a researcher, I can access data through the various agreements that are in place for universities to work with governments and other agencies,” said Dr. Eurich. “But that’s where the relationship ends, it ends with me.” Researchers cannot share data with students or collaborators outside their university, unless they’re part of the project. Private organizations also collect data internally but it “cannot be shipped out, even to trusted collaborators,” said Raymond Ng, professor of computer science at the University of British Columbia.

These limitations hold back collaboration and research. “One of the general benefits of sharing data is simply making the volume larger, so you get a sample-size benefit,” said Dr. Ng. He noted that a lot of health research now focuses on rare diseases, but a single province or hospital often cannot collect enough material on such conditions to generate useful results.

If universities could better link up with private companies around sharing data and research expertise, it could become a lucrative revenue stream for schools and spin-off companies. “Knowing some things could save a lot of money, from a company’s perspective,” said Dr. Eurich. Banks, for instance, want broad fraud metrics while pharmaceutical companies want to better understand the market potential for future drugs.

There are also potential benefits for teaching, which often relies on well-worn datasets. “All the data we use to train epidemiologists, data scientists and computer scientists has been massaged over, cleaned up, made into a perfect data set for them,” says Dr. Eurich. “They leave the university, especially as undergrads, never having seen a real life dataset.”

Computing limitations, security concerns

But while synthetic data holds a lot of promise, it’s still in the development stages and there are drawbacks.

For starters, someone has to custom build an algorithm in order to generate the data set. As Dr. Charest put it, “you don’t have a machine where you can just throw in the data and it spits out synthetic data.”

The process burns through computing power. For one project, Dr. Ng is taking 200 images of tumours and generating 200 images of synthetic tumours that cannot be traced back to their hosts. That took 10 hours of computing time — and it’s just one slice of a larger project. “This is very, very computationally intensive with today’s technology,” Dr. Ng said. When that work is done, someone has to check to see if the dataset can reliably answer research questions.

What’s more, synthetic data can still lead to privacy breaches – a hacker could mine the set and correlate information with other sources. Dr. Charest said that differential privacy, which uses a mathematical formula to assess this risk, is becoming a popular method to assess synthetic and other datasets.

Dr. Eurich is working with a number of organizations in Alberta to create synthetic datasets based on health records in the province, and his research team has already created a subset based on opioid prescriptions. He said many agencies around the world, including Health Canada, try to keep the risk of a privacy breach below 10 per cent. “The analysis we’ve done with our data is coming it at around three per cent,” he said.

Ready for the mainstream?

Dr. El Emam said many groups have been able to produce small, simple synthetic data sets so far. However, very large sets or those that are complex – containing not just numbers or facts but images, for instance – remain a challenge. So are data sets that include rare events. (Think of a heart monitor that can produce months of similar readouts and then, in some patients, a cardiac episode.) Data that includes long sequences, such as DNA, are also difficult. “That will be the next frontier,” said Dr. El Emam, especially since creating synthetic datasets related to DNA could be hugely valuable.

While researchers grapple with how to create foolproof algorithms, the concept of synthetic data is gaining profile. Most notably, the U.S. Census Bureau will release one component of the 2020 census in synthetic form. Dr. Eurich said Statistics Canada is interested in synthetic data too.

While it might not be ready for the mainstream just yet, expect that it will be in the coming years. “I think within the next decade that synthetic data will make up the bulk of data,” said Dr. Eurich.

Diane Peters

Diane Peters is a Toronto-based writer and editor.

1 Comments

Synthetic data is another obfuscation technique. The issue is that anything that obscures data, reduces the utility of said data. The closer the synthetic data mimics the real data, the more likely it is to be possible to re-identify. Rare events are easily missed or misunderstood with synthetic data.

In addition, assuming synthetic data is valuable, transfer of the data means transferring that value, depriving the data creator and the subjects of a share in that value. Imagine an amazing synthetic data set that allows billions in profit. Should the hospitals, health systems and even patients/subjects not share this value?

There are better options in development. Allow use of data without transfer. Allow rental of data without transfer. Use privacy and intellectual privacy preserving double-blind compute environments that never need data to be copied, moved or lost/exposed.

This fits the rubric of ethical data sharing. I would love to explore this topic with anyone who is interested.

Reply to

Cancel reply

This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.

More from News

News Indigenous

Nature and First Nations honoured at Trent University

New stewardship plan reconnects campus green space to Anishinaabe knowledge.

by
Hailey Leggett
June 19, 2026
News Research

Ottawa must step up support for scientific research in French, advisory panel says.

Government-appointed panel calls for $40M annual fund, improved co-ordination, to capitalize on the scientific, economic and cultural potential of French-language research.

by
Jean-François Venne
June 18, 2026
News Teaching

When colleagues become bullies

From public smearing and false rumours to intellectual property theft and exclusion, a Quebec study finds bullying is widespread among university faculty.

by
Martine Letarte
June 17, 2026
News Students

Graduates urged to lead with compassion

Highlights from this year’s inspiring spring convocation speeches across Canada.

by
Sparrow McGowan
June 15, 2026

The promise and pitfalls of synthetic data

Possibilities for research, teaching

Computing limitations, security concerns

Ready for the mainstream?

Share

Post a comment

Cancel reply

1 Comments

Cancel reply

Most popular

Featured Jobs

More from News

Nature and First Nations honoured at Trent University

Ottawa must step up support for scientific research in French, advisory panel says.

When colleagues become bullies

Graduates urged to lead with compassion

More from Research

Funny you should say that

From research to impact: How graduates transform society

Training graduate students in research means investing in the future

How to hook an SSHRC postdoc grant

More from Articles

UNB educates Indian nurses to bolster province’s health care

New federal AI strategy emphasizes university research, training and commercialization

Learning through community service

Universities urged to work together to ramp up AI adoption

The University Affairs Newsletter