Generative AI has garnered significant attention for its capability to produce text and visuals. However, these forms of media only depict a small portion of the vast amount of data that circulates in our society today. Data is constantly generated whenever a patient undergoes medical treatment, a flight is impacted by a storm, or an individual interacts with a software application. By utilizing generative AI to create realistic synthetic data based on these scenarios, organizations can more effectively treat patients, reroute flights, and improve software platforms. This is especially useful in situations where real-world data may be limited or sensitive.
DataCebo, a spinout from MIT, has been offering a generative software system called the Synthetic Data Vault for the past three years. This system aids organizations in creating synthetic data for tasks such as testing software applications and training machine learning models. The Synthetic Data Vault, or SDV, has been downloaded over a million times and has been utilized by more than 10,000 data scientists through its open-source library for generating synthetic tabular data. The co-founders, Principal Research Scientist Kalyan Veeramachaneni and alumnus Neha Patki '15, SM '16, attribute the company's success to SDV's ability to revolutionize software testing.
In 2016, Veeramachaneni's group in the Data to AI Lab introduced a suite of open-source generative AI tools to assist organizations in creating synthetic data that mirrors the statistical properties of real data. This allows companies to use synthetic data in place of sensitive information in programs while still maintaining the statistical relationships between data points. Synthetic data can also be used to simulate new software and evaluate its performance before its public release. The group encountered this issue while collaborating with companies that wanted to share their data for research.
In 2020, the researchers established DataCebo to develop additional features for SDV to cater to larger organizations. Since then, the use cases for SDV have been diverse and impressive. One example is the use of DataCebo's flight simulator, which enables airlines to plan for extreme weather events in a way that was previously impossible with only historical data. In another application, SDV users synthesized medical records to predict health outcomes for patients with cystic fibrosis. Most recently, a team from Norway utilized SDV to generate synthetic student data and assess various admissions policies for meritocracy and bias.
In 2021, the data science platform Kaggle hosted a competition for data scientists using SDV to create synthetic data sets, eliminating the need for proprietary data. Approximately 30,000 data scientists participated, developing innovative solutions and predicting outcomes based on the company's realistic data. As DataCebo continues to grow, it remains true to its MIT origins with all current employees being MIT alumni.
Despite the various use cases for their open-source tools, the company's main focus is on expanding its impact in software testing. For instance, if a bank wanted to test a program that rejects transfers from accounts with insufficient funds, they would need to simulate multiple accounts transacting simultaneously. Doing this manually would be a time-consuming process, but with DataCebo's generative models, customers can create any edge case they desire to test.
Veeramachaneni believes that DataCebo is advancing the field of synthetic enterprise data, also known as data generated from user behavior on large companies' software applications. The company has also recently released new features to enhance SDV's usability, including the SDMetrics library, which assesses the "realism" of the generated data, and SDGym, a way to compare the performance of different models.
As companies across all industries rush to incorporate AI and other data science tools, DataCebo is playing a crucial role in ensuring that this is done in a transparent and responsible manner. By providing synthetic data for testing and analysis, DataCebo is helping organizations make more informed decisions and promoting the responsible use of data.