The process of creating test data can be an arduous task. More often than not, it requires a considerable amount of manual work to obtain a dependable subset of your production data that maintains its referential integrity, business rules, and representative business scenarios. All while protecting the privacy of your data subjects. This blog will help you expedite your test data creation efforts. With just a few mouse clicks in , you'll have ready-to-use test data for your development environment — free of charge and without the need to configure any business rules or logic. Our AI-powered synthetic data engine learns all your dataset's features, taking care of all the hard work for you. MOSTLY AI’s synthetic data platform Before we're going to jump head-first into generating test data, let's first explore a test dataset that's common in e-commerce software engineering: the online shopping cart. The Entity-Relationship diagram below depicts a small version of this dataset and consists of 4 tables: , which contains the data subjects whose privacy you want to protect users and , which are linked tables that represent a user's shopping cart and its contents cart cart_item , which is a reference table that contains all the products this online store sells product We will use this database schema to illustrate the steps you need to take to create a subset, maintain its referential integrity, and demonstrate how MOSTLY AI retains the business rules that underpin the data. We won't need to synthesize the reference table in these steps. As it contains , such as a product’s name, brand, price, and barcode, there's no or information that we need to worry about. These tables can be copied to the test environment as is. Only the tables with privacy-sensitive data need to be synthesized, namely the subject table and the linked tables and . product product information sensitive privacy-relevant user cart cart_item Subsetting Production datasets can be terabytes in size. But what you want for testing is the data rather than of it. MOSTLY AI can create a smaller-sized, representative subset of your data that covers all the business scenarios for testing. Instead of randomly sampling the data, it'll a subset that accurately represents all the production data — QA report included. right all synthesize This is analogous to the way that, for instance, sociologists conduct population studies. It wouldn't be feasible for them to interview the entire population of a country. Instead, they carefully select a sample with which they can say or conclude something about the complete population. Useful and realistic test data works in exactly the same way. Even though it's only a sample, its test cases are representative of the entirety of the production data. With this in mind, let's start generating a part of our synthetic test dataset. The references to the database schema are only there to illustrate the workflow. However, you can use it as a template to extract a dataset from your production data. online shopping cart Log in to and click on in the left side menu. MOSTLY AI’s free edition Runs 2. Drag the CSV file that contains the table into the pane. user Subject table 3. Click on and drag the CSV file that's linked to this subject table to the pane. In the case of our online shopping cart, this would be the table. Add table Table 2 cart 4. You will see the page once you click . Here you can specify how large you want your subset to be. If your production data contains 100,000 subjects, you can take a 10% sample by entering 10.000 in the field. Configure Run Data Proceed Number of generated subjects You can also use instead of . This will create a subset of your production data before the AI model is being trained. This speeds up the test data generation process, but will result in a less accurate rendering of the statistical distributions and correlations in your production data. Number of processed subjects Number of generated subjects 5. A form appears once you click . Here you can specify the relationship between the two tables. Join tables Continue primary key - foreign key 6. Click again to configure your columns’ encoding types. MOSTLY AI already analyzed the contents of your tables for their data types and unique values. You can use the Configure tables page to review the settings before launching the synthetization run. Continue 7. Lastly, click on , sit back, and relax. Launch run Referential integrity For a test dataset to be useful, it must contain the same relationships as the production dataset with accurate values based on business rules. With MOSTLY AI, you can ensure referential integrity by generating the linked tables of your production dataset. conditionally Conditional generation works like an auto-complete for your data. Once you have a training model of a subject table-linked table pair, you can "ask" this model to "complete" any subject table you throw at it, as long as the columns and data types are the same as the original subject table. Upload your handcrafted, updated, or synthesized versions of the subject table, and MOSTLY AI will generate a fitting, realistic linked table. For our online shopping cart dataset, we'll need to conditionally generate the table. It's the last table we need to synthesize, and the items in this table must refer to the carts in the table. This procedure differs slightly from the subsetting step. First, we'll need to create a training model using the original and tables. We can then use MOSTLY AI's "Generate more data" function to generate the synthetic version of the table. cart_item synthetic cart cart cart_item cart_item We also need to consider the reference table in this step. Even though we won't synthesize it, we do need to ensure that we maintain the references to this table. You do this by assigning the encoding type to the foreign key column that links the table to the table. This ensures that the synthetic foreign key column contains the values that are already present in the original foreign key column. product Categorical cart_item product Now let's complete our test dataset by walking through the steps below: Drag the table into the pane and the original table to the pane and click . original cart Subject table cart_item Table 2 Proceed 2. Leave all fields on the page blank and click General Settings Continue. 3. Specify the relationship between the two tables and click . primary key - foreign key Continue 4. Review the column settings. Please set any columns that link to a reference table to , otherwise you'll break the references to this table. For our online shopping cart, this would be the column in the table. Categorical product_id cart_item 5. Click on Launch run and wait for the run to complete. 6. Scroll down to the section of your run and click on . Actions Generate more data 7. Lastly, drag the version of the table to the upload area, and click to synthesize the final table of your test dataset. synthesized cart Generate Congratulations! You successfully synthesized a test dataset from your production data. Conditioned test data Obtaining useful results from legacy test data generators is a laborious task. First, you'll need a deep understanding of the business logic and rules that underpin your production data. This knowledge is necessary to specify the conditions in which certain values occur. These can be simple rules — . But they can also be nested — . Or describe behavioral patterns — . 80% of cars sold cost less than 20,000 euros cars that cost less than 20,000 euros are often compact city cars customers visit a car dealer six times on average before buying a car MOSTLY AI saves you the effort of understanding and defining granular business rules. In fact, it automates the process of studying the data to formulate these rules. As an AI-powered synthetic data generator, it learns the patterns that are present in the production data, assigns probabilities to the conditions in which certain values occur, and synthesizes your test data accordingly. The resulting test data breathes realism and makes your product come alive before it's launched to its users. Conclusion With MOSTLY AI, the struggles of launching new products have become a thing of the past. In this deep-dive, you've experienced where the gains lie when creating a small test dataset, and how it can benefit and improve your work as a test engineer. The Enterprise edition of MOSTLY AI allows you to scale these benefits to production databases that may well contain dozens of tables. You also wouldn't need to perform any manual tasks. Rather, the steps that you went through can be fully automated using its REST API. Either way, whether you're using the Free or Enterprise edition, you're benefitting from a significantly improved turnaround time for your test cases and the ability to point out hidden defects in the codebase with ease. Test data will no longer be the bottleneck that keeps you from delivering value to your customers.
Share Your Thoughts