The process of creating test data can be an arduous task. More often than not, it requires a considerable amount of manual work to obtain a dependable subset of your production data that maintains its referential integrity, business rules, and representative business scenarios. All while protecting the privacy of your data subjects.\n\n\\\nThis blog will help you expedite your test data creation efforts. With just a few mouse clicks in [MOSTLY AI’s synthetic data platform](https://mostly.ai/), you'll have ready-to-use test data for your development environment — free of charge and without the need to configure any business rules or logic. Our AI-powered synthetic data engine learns all your dataset's features, taking care of all the hard work for you.\n\n\\\nBefore we're going to jump head-first into generating test data, let's first explore a test dataset that's common in e-commerce software engineering: the online shopping cart. The Entity-Relationship diagram below depicts a small version of this dataset and consists of 4 tables:\n\n\\\n* **users**, which contains the data subjects whose privacy you want to protect\n* **cart** and **cart_item**, which are linked tables that represent a user's shopping cart and its contents\n* **product**, which is a reference table that contains all the products this online store sells\n\n\\\n !(https://cdn.hackernoon.com/images/E0qcsgkCgJVr8N1NZr6GDPeIOPs1-uc1k356i.png)\n\nWe will use this database schema to illustrate the steps you need to take to create a subset, maintain its referential integrity, and demonstrate how MOSTLY AI retains the business rules that underpin the data.\n\n\\\nWe won't need to synthesize the **product** reference table in these steps. As it contains *product information*, such as a product’s name, brand, price, and barcode, there's no *sensitive* or *privacy-relevant* information that we need to worry about. These tables can be copied to the test environment as is. Only the tables with privacy-sensitive data need to be synthesized, namely the subject table **user** and the linked tables **cart** and **cart_item**.\n\n\\\n## Subsetting\n\nProduction datasets can be terabytes in size. But what you want for testing is the *right* data rather than *all* of it. MOSTLY AI can create a smaller-sized, representative subset of your data that covers all the business scenarios for testing. Instead of randomly sampling the data, it'll *synthesize* a subset that accurately represents all the production data — QA report included.\n\n\\\nThis is analogous to the way that, for instance, sociologists conduct population studies. It wouldn't be feasible for them to interview the entire population of a country. Instead, they carefully select a sample with which they can say or conclude something about the complete population. Useful and realistic test data works in exactly the same way. Even though it's only a sample, its test cases are representative of the entirety of the production data.\n\n\\\nWith this in mind, let's start generating a part of our synthetic test dataset. The references to the *online shopping cart* database schema are only there to illustrate the workflow. However, you can use it as a template to extract a dataset from your production data.\n\n\\\n1. Log in to the [Community edition of MOSTLY AI](https://generate.mostly.ai/signup/register?state=ef9257b7-511a-4b7a-9bcc-095f2c1b3ade&_gl=1\\*ncatrz\\*_ga\\*MTI1MDczMTg3NS4xNjAwNzc3MDQ1\\*_ga_8NGESMV97J\\*MTYyNTY1NzYwOS40MC4xLjE2MjU2NTc2MTkuMA..) and click on *Runs* in the left side menu.\n\n\\\n !(https://cdn.hackernoon.com/images/E0qcsgkCgJVr8N1NZr6GDPeIOPs1-l2g35pf.png)2. Drag the CSV file that contains the **user** table into the *Subject table* pane.\n\n\\\n3\\. Click on *Add table* and drag the CSV file that's linked to this subject table to the *Table 2* pane. In the case of our online shopping cart, this would be the **cart** table.\n\n\\\n4\\. You will see the *Configure Run Data* page once you click *Proceed*. Here you can specify how large you want your subset to be. If your production data contains 100,000 subjects, you can take a 10% sample by entering 10.000 in the *Number of generated subjects* field.\n\n\\\n !(https://cdn.hackernoon.com/images/E0qcsgkCgJVr8N1NZr6GDPeIOPs1-ln4u35ah.png)You can also use *Number of processed subjects* instead of *Number of generated subjects*. This will create a subset of your production data before the AI model is being trained. This speeds up the test data generation process, but will result in a less accurate rendering of the statistical distributions and correlations in your production data.\n\n\\\n5\\. A *Join tables* form appears once you click *Continue*. Here you can specify the *primary key - foreign key* relationship between the two tables.\n\n\\\n !(https://cdn.hackernoon.com/images/E0qcsgkCgJVr8N1NZr6GDPeIOPs1-bp58353w.png)\n\n6\\. Click *Continue* again to configure your columns’ encoding types. MOSTLY AI already analyzed the contents of your tables for their data types and unique values. You can use the Configure tables page to review the settings before launching the synthetization run.\n\n\\\n !(https://cdn.hackernoon.com/images/E0qcsgkCgJVr8N1NZr6GDPeIOPs1-em5j358f.png)7. Lastly, click on *Launch run*, sit back, and relax.\n\n\\\n !(https://cdn.hackernoon.com/images/E0qcsgkCgJVr8N1NZr6GDPeIOPs1-nv3w359a.png)\n\n\\\n## Referential integrity\n\nFor a test dataset to be useful, it must contain the same relationships as the production dataset with accurate values based on business rules. With MOSTLY AI, you can ensure referential integrity by *conditionally* generating the linked tables of your production dataset.\n\n\\\nConditional generation works like an auto-complete for your data. Once you have a training model of a subject table-linked table pair, you can "ask" this model to "complete" any subject table you throw at it, as long as the columns and data types are the same as the original subject table. Upload your handcrafted, updated, or synthesized versions of the subject table, and MOSTLY AI will generate a fitting, realistic linked table.\n\n\\\nFor our online shopping cart dataset, we'll need to conditionally generate the **cart_item** table. It's the last table we need to synthesize, and the items in this table must refer to the carts in the *synthetic* **cart** table. This procedure differs slightly from the subsetting step. First, we'll need to create a training model using the original **cart** and **cart_item** tables. We can then use MOSTLY AI's "Generate more data" function to generate the synthetic version of the **cart_item** table.\n\n\\\n !(https://cdn.hackernoon.com/images/E0qcsgkCgJVr8N1NZr6GDPeIOPs1-ja6n35cn.png)\n\nWe also need to consider the **product** reference table in this step. Even though we won't synthesize it, we do need to ensure that we maintain the references to this table. You do this by assigning the *Categorical* encoding type to the foreign key column that links the **cart_item** table to the **product** table. This ensures that the synthetic foreign key column contains the values that are already present in the original foreign key column.\n\n\\\nNow let's complete our test dataset by walking through the steps below:\n\n\\\n1. Drag the *original* **cart** table into the *Subject table* pane and the original **cart_item** table to the *Table 2* pane and click *Proceed*.\n\n\\\n !(https://cdn.hackernoon.com/images/E0qcsgkCgJVr8N1NZr6GDPeIOPs1-bd8935w3.png)\n\n2\\. Leave all fields on the *General Settings* page blank and click *Continue.*\n\n\\\n !(https://cdn.hackernoon.com/images/E0qcsgkCgJVr8N1NZr6GDPeIOPs1-w78s352e.png)\n\n3\\. Specify the *primary key - foreign key* relationship between the two tables and click *Continue*.\n\n\\\n !(https://cdn.hackernoon.com/images/E0qcsgkCgJVr8N1NZr6GDPeIOPs1-3p9g35re.png)\n\n4\\. Review the column settings. Please set any columns that link to a reference table to *Categorical*, otherwise you'll break the references to this table. For our online shopping cart, this would be the **product_id** column in the **cart_item** table.\n\n\\\n !(https://cdn.hackernoon.com/images/E0qcsgkCgJVr8N1NZr6GDPeIOPs1-xf9u3563.png)\n\n5\\. Click on Launch run and wait for the run to complete.\n\n\\\n !(https://cdn.hackernoon.com/images/E0qcsgkCgJVr8N1NZr6GDPeIOPs1-inat355q.png)\n\n6\\. Scroll down to the *Actions* section of your run and click on *Generate more data*.\n\n\\\n !(https://cdn.hackernoon.com/images/E0qcsgkCgJVr8N1NZr6GDPeIOPs1-atb935fp.png)\n\n7\\. Lastly, drag the *synthesized* version of the **cart** table to the upload area, and click *Generate* to synthesize the final table of your test dataset.\n\n\\\n !(https://cdn.hackernoon.com/images/E0qcsgkCgJVr8N1NZr6GDPeIOPs1-i7dm35a1.png)\n\nCongratulations! You successfully synthesized a test dataset from your production data.\n\n\\\n## Conditioned test data\n\nObtaining useful results from legacy test data generators is a laborious task. First, you'll need a deep understanding of the business logic and rules that underpin your production data. This knowledge is necessary to specify the conditions in which certain values occur. These can be simple rules — *80% of cars sold cost less than 20,000 euros*. But they can also be nested — *cars that cost less than 20,000 euros are often compact city cars*. Or describe behavioral patterns — *customers visit a car dealer six times on average before buying a car*.\n\n\\\nMOSTLY AI saves you the effort of understanding and defining granular business rules. In fact, it automates the process of studying the data to formulate these rules. As an AI-powered synthetic data generator, it learns the patterns that are present in the production data, assigns probabilities to the conditions in which certain values occur, and synthesizes your test data accordingly.\n\n\\\nThe resulting test data breathes realism and makes your product come alive before it's launched to its users.\n\n\\\n## Conclusion\n\nWith MOSTLY AI, the struggles of launching new products have become a thing of the past. In this deep-dive, you've experienced where the gains lie when creating a small test dataset, and how it can benefit and improve your work as a test engineer.\n\n\\\nThe Enterprise edition of MOSTLY AI allows you to scale these benefits to production databases that may well contain dozens of tables. You also wouldn't need to perform any manual tasks. Rather, the steps that you went through can be fully automated using its REST API.\n\n\\\nEither way, whether you're using the Community or Enterprise edition, you're benefitting from a significantly improved turnaround time for your test cases and the ability to point out hidden defects in the codebase with ease. Test data will no longer be the bottleneck that keeps you from delivering value to your customers.