Synthetic Data 101: What It Is, Why It Matters, and How to Use It Well

Nov 4, 2024
5 min read

Updated: Apr 4, 2025

An abstract fluid art piece that is black, blue, white and green

Recently I did up my first Oxford-style debate defending synthetic data’s place in AI at an event put on by the Insights Association at Stevens Institute of Technology in Hoboken, NJ.

The debate format forced me to take a hard stance which is not me at all. I learned a lot from other panelists and our moderator. Where I landed is that synthetic data has a lot of value to offer as long as you use it wisely (making it just like everything else in AI).

Synthetic data absolutely belongs in any AI builder’s toolbox. Like any tool, it’s all about knowing when to use it, how to handle it, and where the pitfalls are. Synthetic data solves some of AI’s biggest data challenges: maintaining privacy, enabling secure data sharing, and scaling up development faster and less expensive.

What is synthetic data?

Synthetic data is machine-generated data that has the same mathematical properties as real data. It’s created using AI, which studies a real dataset, understands its patterns and relationships, and then generates a new dataset that looks very similar in structure but doesn’t include any of the actual information.

Gartner has estimated that 60% of the data used in artificial intelligence and analytics projects will be synthetically generated by 2024

Let's say you have real data from a drug trial—it includes details like age, gender, health conditions, dosages, and patient outcomes. You’d use AI to analyze these patterns and create a new dataset that mirrors them: the synthetic data would have similar relationships and trends, like how a certain age group might respond to a dosage or how different health conditions affect outcomes.

Researchers have been able to take real data and connect back to real people. None of the synthetic data would link back to a real person.

Keeping Sensitive Data Safe

Synthetic data brings huge privacy advantages. Engineers are working with vast amounts of data driving up privacy risks. Microsoft learned this the hard way when their AI researchers accidentally exposed 38 terabytes of private data. If Microsoft—a major player with practically unlimited resources—can slip up like that, how can smaller organizations hope to keep data secure?

A list of credit card transactions might not display an account number but the date, location, and amount might be enough to trace the transaction back to you.

Synthetic data is great for testing, developing, and collaborating. In situations where data breaches could mean disaster, synthetic data can let teams move forward without putting sensitive information at risk.

Democratizing Innovation

Some of the world’s most transformative companies started from a garage or a dorm room. Do we want to limit innovation to only the most well resourced companies who are subject to the whims of Wall Street? Within a company does only one team get to innovate because we can afford to give them real data or can we scale innovation by using synthetic data across all our teams in the early stages of development?

I loved having students from Stevens Institute of Technology but many of them may find that the real-world data they need to work on problems in healthcare or finance is out of reach, restricted due to privacy laws or costs. Synthetic data can level the playing field by giving students, startups, and even smaller teams access to realistic data without the cost and security concerns.

Reducing Costs and Speeding Up Development

Collecting, labeling, and cleaning real-world data is time-consuming and costly. Synthetic makes it easy and less expensive to work with data. This allows teams to train and test models faster, which in turn enables more frequent iterations and improvements. Waymo uses synthetic data to simulate millions of edge cases that are unlikely but critical to improving the safety of autonomous vehicles. By relying on synthetic data, they can train their models on all the rare and risky events they can’t test on the road.

Addressing Synthetic Data’s Limitations

Of course, synthetic data isn’t a cure-all, and the other team in the debate did a good job of highlighting a few risks. For starters, there’s no universal definition of synthetic data, and that lack of standardization means quality varies widely. Not every synthetic dataset is created equally—different methods can produce vastly different results, and what’s useful for one application might be flawed for another.

Synthetic data may have played a role in the failure of IBM’s Watson for Oncology, which was developed with M.D. Anderson Cancer Center to improve cancer care. The system often gave flawed treatment recommendations, such as advising blood thinners for a patient with severe bleeding. Trained on a limited set of hypothetical cases rather than real patient data and relying on input from a small group of specialists rather than broad guidelines, the project ultimately cost M.D. Anderson $62 million without delivering results.

Another valid point is that the real world is messy, and synthetic data may miss crucial nuances and edge cases. Take healthcare, for example: real patient data can contain rare or complex combinations that synthetic data might not replicate. If we rely too heavily on synthetic data without cross-checking it against real data, we risk models that fall short in live scenarios.

Using the drug trail process

To mitigate these issues, I suggested in the debate that we could take a page from the drug development process: synthetic data could be tested in layers, starting in a “lab setting” and moving toward real-world applications. Synthetic datasets could be used for exploration and prototyping, while later stages involve higher-quality synthetic data or real-world data. This way, we can explore concepts and build early versions with synthetic data before transitioning to the high-stakes work that requires real data.

Real Data Has Its Own Drawbacks

Although real data has traditionally been the gold standard, it’s not without serious flaws. Real data often contains historical biases that can warp AI model results. Take Amazon’s recruiting tool that showed gender bias—it was trained on a decade’s worth of resumes that skewed male, and those biases crept into the algorithm. Synthetic data, while not immune to bias, could remove biases.

The Verdict: Synthetic Data Belongs in the Toolbox

Synthetic data is far from perfect, but its advantages make it a powerful tool. Privacy protection, democratized access, lower costs, and faster testing are all reasons to use synthetic data to support AI development. But the risks are real, and any organization using synthetic data should have good governance and testing controls in place.

Image: A recent fluid art piece that you can buy on Etsy.