Synthetic data for real problems

Synthetic data for real problems

How organizations across sectors and industries are using artificially generated data to advance new solutions.

Imagine you’re trying to address an important problem with a new solution that relies on artificial intelligence and machine learning. For instance, if you work in health, you may want to use an AI model to identify risk factors and quickly connect patients with the care they need. To build and test your solution, you’d need access to data — lots of data — but health records and other personal information are private and protected. Ideally, you’d have access to a robust data set that mimicked real-world data but preserved privacy.

Across sectors and problem spaces, advances in artificial intelligence and machine learning require massive amounts of data. In some industries, that data is readily available and unrestricted. But when there’s not enough data, or when data is too sensitive or restricted to be used for training models, innovators can fill the gaps with synthetic data.

What is synthetic data?

A definition:

Synthetic data is artificially generated by an AI algorithm that has been trained on a real data set. Synthetic data has the same characteristics — including statistical properties and patterns — of real data, but has no connection to personally identifiable information. If it’s used to build or test an application, or used for analysis, it should perform like real data would. (Sources: MIT Sloan Management Review, MIT News, O’Reilly)

Synthetic data is different from de-identified or anonymized data, which attempts to remove or mask identifiable information such as names, addresses, gender, and race. (Researchers have demonstrated that it’s possible to link anonymized data to real people.) Synthetic data, on the other hand, cannot be traced back to a real person or real-world event.

Large technology players like Microsoft, Amazon, Google, and NVIDIA are embracing synthetic data, and MIT researchers have developed open-source tools for expanding access to synthetic data. Organizations across multiple sectors and industries have been investing in synthetic data to strengthen operations and expand their AI/ML capabilities.

Synthetic data in healthcare

In healthcare, AI and ML have the potential to improve diagnosis and research — as well as how care is allocated and accessed. But privacy is a major barrier to using data from real patients, so it’s difficult to train new AI and ML models without synthetic data. To advance research, treatments, and vaccines during the COVID-19 pandemic, the National Institutes of Health partnered with Syntegra to generate a synthetic data set representing millions of patient records. Hospital systems such as Rambam Health Care Campus in Israel are working with MDClone to create and share synthetic data with research collaborators. Similarly, the University of Florida’s academic health center partnered with NVIDIA to develop SynGatorTron, a language model that generates synthetic patient profiles. Researchers can use these profiles to train AI models, as well as augment small datasets for rare diseases or underrepresented patient populations. Anthem is partnering with Google to generate synthetic medical histories, insurance claims, and other healthcare data, with the goal of detecting fraud and providing more personalized care to its members.

Synthetic data in retail

Gartner estimates that by 2024, privacy regulations will protect personal data for three-quarters of the world’s population. While this will benefit consumers, it will present challenges for retailers, who use data to understand customer preferences, improve shopping experiences, and make business decisions. Amazon used synthetic data to intentionally introduce errors and challenge its computer vision technology in cashierless Amazon Go stores, reducing the need for on-site support staff and making it less likely that a customer would encounter an issue on their first visit. Synthetic data also creates opportunities for retailers to share representative copies of consumer data with manufacturers or advertisers without violating customer privacy.

Synthetic data in transportation

As transportation and logistics companies develop autonomous vehicles, synthetic data can help test real-world scenarios — including rare incidents — without real-world risks. NVIDIA’s Omniverse simulation platform helps engineers create synthetic training data for warehouse robots and self-driving vehicles, lowering the cost of failure and expanding the set of possible scenarios. Alphabet’s Waymo used synthetic data to train its self-driving cars, generating 2.5 billion miles of simulated driving data to augment real-world testing. Companies such as Volvo and Tesla have used gaming engines to simulate driving conditions and test autonomous features. Volvo worked with 3D-development and mixed-reality companies to test authentic driver reactions in simulated traffic scenarios. Tesla partnered with gaming engine Unreal to test its autonomous driving software against edge-case scenarios in a simulated version of San Francisco. Scientists at Carnegie Mellon University explored the use of synthetic data to develop a more complete picture of public transit passengers for city planners, while researchers at MIT developed a model to generate location and mobility data for a synthetic population. Because synthetic data can create large, inexpensive datasets, it can be a cost-effective solution for local and state transportation agencies. The U.S. Department of Transportation collaborated with university research teams to develop realistic artificial datasets to test safety analysis models and identify how well each model reflected real-world cause-and-effect relationships.

Synthetic data in financial services

Financial institutions and insurers have been using algorithms for years, but the introduction of synthetic data expands the types of transactions they can simulate. Financial institutions such as American Express and J.P. Morgan use synthetic data to improve fraud detection and make better lending decisions. With synthetic data, companies can test algorithms against rare situations or hypothesized scenarios, then adjust parameters. In addition to speeding up iteration, this use of synthetic data may also help counter bias and ensure better algorithmic decision-making. An insurer wanted to accurately reflect climate risk in home insurance premiums, but couldn’t use real location data because of privacy regulations. To train its pricing model, the company used synthetic geolocation data to determine fire and flood risks. Swiss insurance company La Mobilière created a model to assess customer churn and trained it using synthetic customer data to preserve privacy and comply with data protection laws.

Synthetic data in government services

Use of synthetic data in government agencies is expanding as agencies look to increase collaboration with researchers and innovators. Several products from the U.S. Census Bureau’s  Longitudinal Employer-Household Dynamics program already use synthetic data, and the agency is considering the use of synthetic data for its American Community Survey as a way to improve accuracy of information on historically undercounted communities. Norwegian government agencies such as the Norwegian Tax Administration and the Norwegian Labour and Welfare Administration generate synthetic national registries to support IT test environments; this preserves citizen privacy with an inexpensive, reusable data solution. And the U.S. Department of Veterans Affairs created synthetic data sets for Mission Daybreak, a $20 million grand challenge to reduce Veteran suicides, as part of an accelerator designed to help finalists advance their suicide prevention solutions. Using synthetic data eliminates the risk of exposing sensitive medical data while allowing innovators to safely test new models.


Use of synthetic data will likely accelerate as more organizations understand its potential. Some analysts predict that 60% of the data used for the development of AI and analytics projects will be synthetically generated by 2024.

As with any emerging technology, organizations should thoughtfully consider if and how they will use synthetic data. To start, organizations should build internal awareness around synthetic data to better understand its potential and how it might be used to support business objectives. Part of that awareness will mean recognizing that all the problems related to “big data” — from structure and management to bias and standardization — will also apply to synthetic data.

Synthetic data should support a specific solution to a specific problem — for instance, filling data gaps or addressing edge cases. Organizations will need to identify real needs, develop a business case, and sell the idea internally. The approach may vary, depending on an organization’s budget and capabilities. It takes time to build a team; when starting something new, most organizations hire a partner and start small with a test program.