Test Data That Thinks for Itself: AI-Powered Test Data Generation

Setting the Stage: Unleashing the Data Dragon

How often have you struggled to generate test data for you tests? Stop your struggles! Join the dark side and start using AI to generate it. Imagine this: you’re a regular software tester armed with your trusty test cases and ready to find all the bugs… But, you spend hours generating data which you need.

Test data generation is the process of creating meaningful and representative data that reflects the real-world scenarios in the software. It’s the main entry point for validating the functionality, performance, and security of applications. Also, ensuring data diversity and coverage is not an easy feat. Testers must include a wide range of data variations, edge cases, and boundary conditions to thoroughly check software.

Another challenge is a limited availability of quality test data. Sensitive or realistic data cannot be used for testing due to privacy concerns or data protection regulations, while it’s safe for the user, testers spend significant amount of time generating this data manually.

In this article we will discuss how AI can help us with test data generation and what the best approaches are.

Test data is generated by chatGPT and Google Bard

JSON file generation

Let’s imagine: we have an API test and we need to send a request with JSON for two cities in EU along with the correct phone number for each city.

This JSON file has to have:

City
Country
Country code
Phone number

Let’s try to create a prompt for this task:

Dear Chat GPT, I need you to generate JSON file with 2 the most popular cities in EU. JSON should have following variables: city, country, country code and correct phone number according to the city.

Result:

[
  {
    "city": "Berlin",
    "country": "Germany",
    "country_code": "DE",
    "phone_number": "+49 30 12345678"
  },
  {
    "city": "Paris",
    "country": "France",
    "country_code": "FR",
    "phone_number": "+33 1 23456789"
  }
]

We saved so much time! Now instead of browsing and searching for country code and phone number we can use AI to generate data.

Additional data generation

Let’s take example with JSON and ask it to generate few more sets of data based on the JSON which we will provide.

My lovely Chat GPT, based on the JSON which I’m providing to you, could generate more data to this?

<JSON>

Result:

[
  {
    "city": "Berlin",
    "country": "Germany",
    "country_code": "DE",
    "phone_number": "+49 30 12345678"
  },
  {
    "city": "Paris",
    "country": "France",
    "country_code": "FR",
    "phone_number": "+33 1 23456789"
  }, 
  {
    "city": "New York",
    "country": "United States",
    "country_code": "US",
    "phone_number": "+1 212 3456789"
   },
   {
    "city": "London",
    "country": "United Kingdom",
    "country_code": "UK",
    "phone_number": "+44 20 12345678"
    }
]

YAML file generation

Let’s imagine: we have to test e-commerce shop integration, and for this we would need YAML file with products inside.

Let’s try to create a prompt for this task:

Dearest Chat GPT, I need you to generate YAML file with 3 random products inside. Can you add as much complexity as possible?

(we want this to be as complex as possible to test how our e-commerce integration works with unusual attributes)

Result:

- name: Product 1
  description: High-end smartphone with advanced features and powerful processor.
  price:
    amount: 999.99
    currency: USD
  quantity:
    available: 50
    minimum_order: 1
    maximum_order: 5
  category: Electronics
  specifications:
    - display: OLED
    - processor: Qualcomm Snapdragon 855
    - camera: Dual rear cameras (12MP + 16MP)
    - battery: 4000mAh
    - features:
      - Wireless charging
      - Water and dust resistance (IP68)
- name: Product 2
  description: Stylish and durable stainless steel watch with a classic design.
  price:
    amount: 199.99
    currency: USD
  quantity:
    available: 20
    minimum_order: 1
    maximum_order: 3
  category: Fashion
  specifications:
    - movement: Quartz
    - crystal: Sapphire
    - water_resistance: 100 meters
    - strap: Stainless steel bracelet
- name: Product 3
  description: Versatile laptop designed for work and entertainment.
  price:
    amount: 1499.99
    currency: USD
  quantity:
    available: 30
    minimum_order: 1
    maximum_order: 2
  category: Electronics
  specifications:
    - display: 15.6-inch high-resolution screen
    - processor: Intel Core i7
    - storage: 1TB SSD
    - keyboard: Backlit
    - security: Fingerprint scanner
    - battery: 8 hours

That looks awesome!

XML file generation

Let’s imagine: we need to generate an XML file with catalogs of product inside for testing e-commerce software.

Dear Chat GPT, I need you to generate an XML file with 3 catalogs inside. Be as creative as you can.

Result:

<?xml version="1.0" encoding="UTF-8"?>
<catalogs>
  <catalog>
    <id>101</id>
    <name>Enchanted Garden</name>
    <description>A magical collection of plants and garden accessories that bring your outdoor spaces to life.</description>
  </catalog>
  <catalog>
    <id>102</id>
    <name>Adventure Gear</name>
    <description>Equip yourself for thrilling adventures with our wide range of gear, from hiking essentials to camping equipment.</description>
  </catalog>
  <catalog>
    <id>103</id>
    <name>Mystical Potions</name>
    <description>Discover the secrets of ancient alchemy with our extraordinary collection of potions and elixirs.</description>
  </catalog>
</catalogs>

SQL data generation

Let’s imagine: we’ve got some SQL automation tests to write, and we need to generate some data in these auto tests

Dear ChatGPT I need you to generate SQL script where you put some data to XXX table. It has username, order number, price for the order and current time.

Result:

INSERT INTO XXX (username, order_number, price, current_time)
VALUES ('JohnDoe', 'ORD001', 19.99, CURRENT_TIMESTAMP),
       ('JaneSmith', 'ORD002', 24.99, CURRENT_TIMESTAMP),
       ('RobertJohnson', 'ORD003', 14.99, CURRENT_TIMESTAMP);

We got script ready to use and some generated data for this script.

But. Let’s try to ask something more complex:

Dear ChatGPT I need you to generate SQL script where you put some data to XXX table. It has id, username, order number, price for the order and current time. Second table should have id, username, phone number and address. I want you to join these two tables and write me the result.

Result:

-- Create XXX table
CREATE TABLE XXX (
    id INT,
    username VARCHAR(50),
    order_number VARCHAR(50),
    price DECIMAL(10, 2),
    current_time TIMESTAMP
);

-- Insert data into XXX table
INSERT INTO XXX (id, username, order_number, price, current_time)
VALUES (1, 'JohnDoe', 'ORD001', 19.99, CURRENT_TIMESTAMP),
       (2, 'JaneSmith', 'ORD002', 24.99, CURRENT_TIMESTAMP),
       (3, 'RobertJohnson', 'ORD003', 14.99, CURRENT_TIMESTAMP);

-- Create YYY table
CREATE TABLE YYY (
    id INT,
    username VARCHAR(50),
    phone_number VARCHAR(15),
    address VARCHAR(100)
);

-- Insert data into YYY table
INSERT INTO YYY (id, username, phone_number, address)
VALUES (1, 'JohnDoe', '1234567890', '123 Main St'),
       (2, 'JaneSmith', '9876543210', '456 Elm St'),
       (3, 'RobertJohnson', '5555555555', '789 Oak St');

-- Perform a join between XXX and YYY tables
SELECT XXX.id, XXX.username, XXX.order_number, XXX.price, YYY.phone_number, YYY.address
FROM XXX
JOIN YYY ON XXX.username = YYY.username;

Data generation for testing

Sometimes you have a task to test an input, and that’s a lot of test cases and loads of data. But what if you don’t have time to write all these test cases. How do you check that you’ve tested everything? How do you generate random data for every case which you’re going to do?

The answer is easy! Let’s use ChatGPT 😀

Dear ChatGPT, can you generate me test data for LinkedIn search field with positive, negative scenarios, and with edge cases

Result:

Positive scenarios:

Search for a job title: "Software Engineer"
Search for a company name: "Google"
Search for a location: "San Francisco, CA"
Search for a skill: "Python"
Search for a keyword: "machine learning"

Negative scenarios:

Search for a job title that does not exist: "Unicorn Hunter"
Search for a company name that does not exist: "Acme Corporation"
Search for a location that does not exist: "Hogwarts, UK"
Search for a skill that does not exist: "Quantum Computing"
Search for a keyword that is too common: "the"

Edge cases:

Search for a job title with multiple words: "Senior Software Engineer"
Search for a company name with special characters: "Acme, Inc."
Search for a location with multiple words: "San Francisco Bay Area"
Search for a skill with multiple words: "Data Science"
Search for a keyword with special characters: "@"

It needs some more cases, but it’s a perfect start! You can use it on your own, you can send it to the office intern or write test cases from this.

SQL injection

Let’s imagine: we also need to check the input on some security protection.

Dear ChatGPT, can you generate me SQL injection LinkedIn search field

Result:

1. SQL Injection Attack:
Input: ' OR '1'='1
Expected Result: The search query is manipulated to retrieve all records from the database, potentially exposing sensitive information.

2. Union-based SQL Injection:
Input: ' UNION SELECT username, password FROM users --
Expected Result: The search query is manipulated to perform a UNION SELECT operation, potentially retrieving usernames and passwords from the database.

3. Boolean-based Blind SQL Injection:
Input: ' OR (SELECT COUNT(*) FROM users) > 0 --
Expected Result: The search query is manipulated to perform a Boolean-based SQL injection, exploiting a vulnerability in the application's response to different conditions.

4. Time-based Blind SQL Injection:
Input: ' OR SLEEP(5) --
Expected Result: The search query is manipulated to introduce a delay (in this case, 5 seconds) to test for time-based blind SQL injection vulnerabilities.

5. Error-based SQL Injection:
Input: ' OR 1/0 --
Expected Result: The search query is manipulated to trigger a SQL error, potentially revealing information about the database structure or sensitive data.

In conclusion, test data generation with AI presents a powerful and efficient solution to one of the challenges during testing. Now we can automate the process of generating any test data sets and spending seconds on this. However, it's important to remember that AI-generated test data is not a panacea. Human expertise is still crucial part in defining appropriate test data sets, validating results, and more.