paint-brush
Create Realistic and Secure Test Data with our Synthetic Data Platformby@mostlyai
335 reads
335 reads

Create Realistic and Secure Test Data with our Synthetic Data Platform

by MOSTLY AIJuly 14th, 2021
Read on Terminal Reader
Read this story w/o Javascript
tldt arrow

Too Long; Didn't Read

Learn how to fully automate your test data creation efforts. Let our AI model privacy-protect your production data and figure out its business rules.

Company Mentioned

Mention Thumbnail
featured image - Create Realistic and Secure Test Data with our Synthetic Data Platform
MOSTLY AI HackerNoon profile picture

The process of creating test data can be an arduous task. More often than not, it requires a considerable amount of manual work to obtain a dependable subset of your production data that maintains its referential integrity, business rules, and representative business scenarios. All while protecting the privacy of your data subjects.


This blog will help you expedite your test data creation efforts. With just a few mouse clicks in MOSTLY AI’s synthetic data platform, you'll have ready-to-use test data for your development environment — free of charge and without the need to configure any business rules or logic. Our AI-powered synthetic data engine learns all your dataset's features, taking care of all the hard work for you.


Before we're going to jump head-first into generating test data, let's first explore a test dataset that's common in e-commerce software engineering: the online shopping cart. The Entity-Relationship diagram below depicts a small version of this dataset and consists of 4 tables:


  • users, which contains the data subjects whose privacy you want to protect
  • cart and cart_item, which are linked tables that represent a user's shopping cart and its contents
  • product, which is a reference table that contains all the products this online store sells


We will use this database schema to illustrate the steps you need to take to create a subset, maintain its referential integrity, and demonstrate how MOSTLY AI retains the business rules that underpin the data.


We won't need to synthesize the product reference table in these steps. As it contains product information, such as a product’s name, brand, price, and barcode, there's no sensitive or privacy-relevant information that we need to worry about. These tables can be copied to the test environment as is. Only the tables with privacy-sensitive data need to be synthesized, namely the subject table user and the linked tables cart and cart_item.


Subsetting

Production datasets can be terabytes in size. But what you want for testing is the right data rather than all of it. MOSTLY AI can create a smaller-sized, representative subset of your data that covers all the business scenarios for testing. Instead of randomly sampling the data, it'll synthesize a subset that accurately represents all the production data — QA report included.


This is analogous to the way that, for instance, sociologists conduct population studies. It wouldn't be feasible for them to interview the entire population of a country. Instead, they carefully select a sample with which they can say or conclude something about the complete population. Useful and realistic test data works in exactly the same way. Even though it's only a sample, its test cases are representative of the entirety of the production data.


With this in mind, let's start generating a part of our synthetic test dataset. The references to the online shopping cart database schema are only there to illustrate the workflow. However, you can use it as a template to extract a dataset from your production data.


  1. Log in to MOSTLY AI’s free edition and click on Runs in the left side menu.


2. Drag the CSV file that contains the user table into the Subject table pane.


3. Click on Add table and drag the CSV file that's linked to this subject table to the Table 2 pane. In the case of our online shopping cart, this would be the cart table.


4. You will see the Configure Run Data page once you click Proceed. Here you can specify how large you want your subset to be. If your production data contains 100,000 subjects, you can take a 10% sample by entering 10.000 in the Number of generated subjects field.


You can also use Number of processed subjects instead of Number of generated subjects. This will create a subset of your production data before the AI model is being trained. This speeds up the test data generation process, but will result in a less accurate rendering of the statistical distributions and correlations in your production data.


5. A Join tables form appears once you click Continue. Here you can specify the primary key - foreign key relationship between the two tables.


6. Click Continue again to configure your columns’ encoding types. MOSTLY AI already analyzed the contents of your tables for their data types and unique values. You can use the Configure tables page to review the settings before launching the synthetization run.


7. Lastly, click on Launch run, sit back, and relax.



Referential integrity

For a test dataset to be useful, it must contain the same relationships as the production dataset with accurate values based on business rules. With MOSTLY AI, you can ensure referential integrity by conditionally generating the linked tables of your production dataset.


Conditional generation works like an auto-complete for your data. Once you have a training model of a subject table-linked table pair, you can "ask" this model to "complete" any subject table you throw at it, as long as the columns and data types are the same as the original subject table. Upload your handcrafted, updated, or synthesized versions of the subject table, and MOSTLY AI will generate a fitting, realistic linked table.


For our online shopping cart dataset, we'll need to conditionally generate the cart_item table. It's the last table we need to synthesize, and the items in this table must refer to the carts in the synthetic cart table. This procedure differs slightly from the subsetting step. First, we'll need to create a training model using the original cart and cart_item tables. We can then use MOSTLY AI's "Generate more data" function to generate the synthetic version of the cart_item table.


We also need to consider the product reference table in this step. Even though we won't synthesize it, we do need to ensure that we maintain the references to this table. You do this by assigning the Categorical encoding type to the foreign key column that links the cart_item table to the product table. This ensures that the synthetic foreign key column contains the values that are already present in the original foreign key column.


Now let's complete our test dataset by walking through the steps below:


  1. Drag the original cart table into the Subject table pane and the original cart_item table to the Table 2 pane and click Proceed.


2. Leave all fields on the General Settings page blank and click Continue.


3. Specify the primary key - foreign key relationship between the two tables and click Continue.


4. Review the column settings. Please set any columns that link to a reference table to Categorical, otherwise you'll break the references to this table. For our online shopping cart, this would be the product_id column in the cart_item table.


5. Click on Launch run and wait for the run to complete.


6. Scroll down to the Actions section of your run and click on Generate more data.


7. Lastly, drag the synthesized version of the cart table to the upload area, and click Generate to synthesize the final table of your test dataset.


Congratulations! You successfully synthesized a test dataset from your production data.


Conditioned test data

Obtaining useful results from legacy test data generators is a laborious task. First, you'll need a deep understanding of the business logic and rules that underpin your production data. This knowledge is necessary to specify the conditions in which certain values occur. These can be simple rules — 80% of cars sold cost less than 20,000 euros. But they can also be nested — cars that cost less than 20,000 euros are often compact city cars. Or describe behavioral patterns — customers visit a car dealer six times on average before buying a car.


MOSTLY AI saves you the effort of understanding and defining granular business rules. In fact, it automates the process of studying the data to formulate these rules. As an AI-powered synthetic data generator, it learns the patterns that are present in the production data, assigns probabilities to the conditions in which certain values occur, and synthesizes your test data accordingly.


The resulting test data breathes realism and makes your product come alive before it's launched to its users.


Conclusion

With MOSTLY AI, the struggles of launching new products have become a thing of the past. In this deep-dive, you've experienced where the gains lie when creating a small test dataset, and how it can benefit and improve your work as a test engineer.


The Enterprise edition of MOSTLY AI allows you to scale these benefits to production databases that may well contain dozens of tables. You also wouldn't need to perform any manual tasks. Rather, the steps that you went through can be fully automated using its REST API.


Either way, whether you're using the Free or Enterprise edition, you're benefitting from a significantly improved turnaround time for your test cases and the ability to point out hidden defects in the codebase with ease. Test data will no longer be the bottleneck that keeps you from delivering value to your customers.