7,287 reads

How to Efficiently Generate Test Data With SQL

by Yuri DanilovSeptember 26th, 2022

Too Long; Didn't Read

There are a large number of utilities for generating test data. But in some cases you may wish to generate this data manually. I would like to share my own experience in this area and hope that the examples that I will cover in this article will help you with this.

Companies Mentioned

Coin Mentioned

featured image - How to Efficiently Generate Test Data With SQL

There are a large number of utilities for generating test data. But in some cases, you may wish to generate this data manually.

I would like to share my own experience in this area and hope that the examples that I will cover in this article will help you with this.

You can always use database-specific functions and they will probably work more efficiently, but I use primitive functions to make the code as portable as possible.

The examples below are for PostgreSQL but can be adapted for other databases with some modifications. Later I will talk about this.

Series

First of all, we need a query generating a specified number of rows. I prefer to use the generate_series function for this:

select s as id from generate_series(1, 1000) AS s; -- number of rows to generate

The result set contains 1000 rows:

IDs

First and last name

Let's say we want to generate the names of our clients.

We can use a couple of arrays containing first names and last names and randomly select values from them:

select
    arrays.firstnames[trunc(random() * ARRAY_LENGTH(arrays.firstnames, 1) + 1)] AS firstname,
    arrays.lastnames[trunc(random() * ARRAY_LENGTH(arrays.lastnames,1) + 1)] AS lastname
from  (
    SELECT ARRAY[
		'Adam', 'Bill', 'Bob', 'Donald', 'Frank', 'George', 'James', 'John', 'Jacob', 'Jack', 'Martin', 'Matthew', 'Max', 'Michael', 'Paul','Peter', 'Ronald',
		'Samuel','Steve','William', 'Abigail', 'Alice', 'Amanda', 'Barbara','Betty', 'Carol', 'Donna', 'Jane','Jennifer','Julie','Mary','Melissa','Sarah','Susan'
    ] AS firstnames,
    ARRAY[
        'Matthews','Smith','Jones','Davis','Jacobson','Williams','Donaldson','Maxwell','Peterson','Stevens', 'Franklin','Washington','Jefferson','Adams',
        'Jackson','Johnson','Lincoln','Grant','Fillmore','Harding','Taft', 'Truman','Nixon','Ford','Carter','Reagan','Bush','Clinton','Hancock'
    ] AS lastnames
) AS arrays;

The result:

Here we use the expression that generates a random index within the array length:

trunc(random() * ARRAY_LENGTH(arrays.firstnames, 1) + 1)

Connecting arrays with series

Combined with the generate_series function, our query can now generate 1000 rows of random first and last names with IDs:

select s as id,
    arrays.firstnames[trunc(random() * ARRAY_LENGTH(arrays.firstnames, 1) + 1)] AS firstname,
    arrays.lastnames[trunc(random() * ARRAY_LENGTH(arrays.lastnames,1) + 1)] AS lastname
from generate_series(1, 1000) AS s -- number of rows to generate
CROSS JOIN(
    SELECT ARRAY[
		'Adam', 'Bill', 'Bob', 'Donald', 'Frank', 'George', 'James', 'John', 'Jacob', 'Jack', 'Martin', 'Matthew', 'Max', 'Michael', 'Paul','Peter', 'Ronald',
		'Samuel','Steve','William', 'Abigail', 'Alice', 'Amanda', 'Barbara','Betty', 'Carol', 'Donna', 'Jane','Jennifer','Julie','Mary','Melissa','Sarah','Susan'
    ] AS firstnames,
    ARRAY[
        'Matthews','Smith','Jones','Davis','Jacobson','Williams','Donaldson','Maxwell','Peterson','Stevens', 'Franklin','Washington','Jefferson','Adams',
        'Jackson','Johnson','Lincoln','Grant','Fillmore','Harding','Taft', 'Truman','Nixon','Ford','Carter','Reagan','Bush','Clinton','Hancock'
    ] AS lastnames
) AS arrays;

And an example of the result:

Random letter for a middle name

Suppose we also need a random letter for a middle name, then we can use the following query:

select substring('ABCDEFGHIJKLMNOPQRSTUVWXYZ' from trunc(random() * 26 + 1)::int for 1) AS middlename;

We take a string containing all the letters of the alphabet and choose a random character from it.

The result:

Date of birth

What other fields might we need? Maybe a date of birth.

The following expression returns a date between 18 and 100(approximately) years ago:

select date(now() - trunc(random() * 365 * 82 /*max age + 18*/) * '1 day'::interval - interval '18 year' /* min age*/) as birth_date;

Example of birth date:

You can check the correctness of the expression by replacing the random with 1. This expression returns us to date approximately 100 years ago:

select date(now() - trunc(1 * 365 * 82 /*max age + 18*/) * '1 day'::interval - interval '18 year' /* min age*/) as birth_date;

"Approximately" - because we don't take into account leap years.

SSN

SSN is a 9-digit value that can be obtained in the following way:

select 100000000 + round(random() * 900000000) as ssn;

SSN:

SSN

The SSN is usually written in the format 999-99-9999. A little later we will see how to do this.

Money

Also, for financial institutions, the amount on the client's account matters.

Often it is represented by 2 decimal places, so we will specify an interval and round it up:

select round((random() * 100000)::numeric, 2) as amount;

The result:

Address and phone

Often you need information about the customer's address and phone.

An address usually consists of a house number, street, city, and state.

Let's add the corresponding arrays to our query:

    ARRAY[
		'Green', 'Smith', 'Church', 'Grant', 'Cedar', 'Forest', 'Frankl', 'Birch', 'Jones', 'Brown',
		'Cherry', 'Willow', 'Rose', 'School', 'Wilson', 'Center', 'Walnut', 'Mill', 'Valley'
    ] as streets,
    ARRAY[
		'Washington', 'Franklin', 'Clinton', 'Georgetown', 'Springfield', 'Chester', 'Greenville', 'Dayton', 'Madison', 'Salem',
		'Winchester', 'Oakland', 'Milton', 'Newport', 'Ashland', 'Riverside', 'Manchester', 'Oxford', 'Burlington', 'Jackson', 'Milford'
    ] as cities,
    ARRAY[
		'AK','AL','AR','AZ','CA','CO','CT','DC','DE','FL','GA','HI','IA','ID','IL','IN','KS','KY','LA','MA','MD','ME','MI','MN','MS','MO',
		'MT','NC','NE','NH','NJ','NM','NV','NY','ND','OH','OK','OR','PA','RI','SC','SD','TN','TX','UT','VT','VA','WA','WV','WI','WY'
    ] as states

The street name is often followed by its type. We can use the case operator with a random switch parameter like this:

select case (random() * 2)::int when 0 then 'St.' when 1 then 'Ave.' when 2 then 'Rd.' end as street_type;

The result:

The case construct is useful when you want to randomize a small number of distinct values.

A house number is just a random number, let's say between 100 and 10000:

select (100 + random() * 9900)::int as house;

The result:

A phone number in the 'XXX XXX XXXX' format can be generated like this:

select concat(FLOOR(100 + random() * 900), ' ', FLOOR(100 + random() * 900), ' ', FLOOR(1000 + random() * 9000)) as phone;

Phone number:

Putting it all together

Now we put everything together and we get the following query:

select s as id,
    arrays.firstnames[trunc(random() * ARRAY_LENGTH(arrays.firstnames, 1) + 1)] AS firstname,
    substring('ABCDEFGHIJKLMNOPQRSTUVWXYZ' from trunc(random() * 26 + 1)::int for 1) AS middlename,
    arrays.lastnames[trunc(random() * ARRAY_LENGTH(arrays.lastnames,1) + 1)] AS lastname,
	date(now() - trunc(random() * 365 * 82 /*max age + 18*/) * '1 day'::interval - interval '18 year' /* min age*/) as birth_date,
	100000000 + round(random() * 900000000) as ssn,
	round((random() * 100000)::numeric, 2) as amount,
	(100 + random() * 9900)::int as house,
	arrays.streets[trunc(random() * ARRAY_LENGTH(arrays.streets, 1) + 1)]  AS street,
	case (random() * 2)::int when 0 then 'St.' when 1 then 'Ave.' when 2 then 'Rd.' end as street_type,
	arrays.cities[trunc(random() * ARRAY_LENGTH(arrays.cities, 1) + 1)] AS city,
	arrays.states[trunc(random() * ARRAY_LENGTH(arrays.states, 1) + 1)] AS state,
	concat(FLOOR(100 + random() * 900), ' ', FLOOR(100 + random() * 900), ' ', FLOOR(1000 + random() * 9000)) as phone
from generate_series(1, 1000) AS s -- number of rows to generate
CROSS JOIN(
    SELECT ARRAY[
		'Adam', 'Bill', 'Bob', 'Donald', 'Frank', 'George', 'James', 'John', 'Jacob', 'Jack', 'Martin', 'Matthew', 'Max', 'Michael', 'Paul','Peter', 'Ronald',
		'Samuel','Steve','William', 'Abigail', 'Alice', 'Amanda', 'Barbara','Betty', 'Carol', 'Donna', 'Jane','Jennifer','Julie','Mary','Melissa','Sarah','Susan'
    ] AS firstnames,
    ARRAY[
        'Matthews','Smith','Jones','Davis','Jacobson','Williams','Donaldson','Maxwell','Peterson','Stevens', 'Franklin','Washington','Jefferson','Adams',
        'Jackson','Johnson','Lincoln','Grant','Fillmore','Harding','Taft', 'Truman','Nixon','Ford','Carter','Reagan','Bush','Clinton','Hancock'
    ] AS lastnames,
    ARRAY[
		'Green', 'Smith', 'Church', 'Grant', 'Cedar', 'Forest', 'Frankl', 'Birch', 'Jones', 'Brown',
		'Cherry', 'Willow', 'Rose', 'School', 'Wilson', 'Center', 'Walnut', 'Mill', 'Valley'
    ] as streets,
    ARRAY[
		'Washington', 'Franklin', 'Clinton', 'Georgetown', 'Springfield', 'Chester', 'Greenville', 'Dayton', 'Madison', 'Salem',
		'Winchester', 'Oakland', 'Milton', 'Newport', 'Ashland', 'Riverside', 'Manchester', 'Oxford', 'Burlington', 'Jackson', 'Milford'
    ] as cities,
    ARRAY[
		'AK','AL','AR','AZ','CA','CO','CT','DC','DE','FL','GA','HI','IA','ID','IL','IN','KS','KY','LA','MA','MD','ME','MI','MN','MS','MO',
		'MT','NC','NE','NH','NJ','NM','NV','NY','ND','OH','OK','OR','PA','RI','SC','SD','TN','TX','UT','VT','VA','WA','WV','WI','WY'
    ] as states
) AS arrays;

And the result set:

Template string randomization function

We used a concatenation of several random values to generate a phone number, but in some cases, this method is not very convenient.

If you have a large number of columns whose values must match a certain pattern, you can use the function:

CREATE OR REPLACE FUNCTION rand_format(str text)
RETURNS text
LANGUAGE sql
AS $function$
       select
         array_to_string(
           array_agg(
             replace(replace(x, 'A', substring('ABCDEFGHIJKLMNOPQRSTUVWXYZ' from trunc(random()*26 + 1)::int for 1)), '9', trunc(random()*9 + 1)::text)
           ),
         '')
       from (select regexp_split_to_table(str, '') as x) a
$function$
;

The function takes a string containing a pattern and replaces 9 with a random number and the letter A with a random alphabetic character.

Examples of how to use it:

select
	rand_format('(999) 999-9999') as phone_number,
	rand_format('AAA-9999') as plate_number,
	rand_format('99 99 999999') as passport_number,
	rand_format('99 AA 999999') as driver_licence,
	rand_format('199.199.199.199') as ip_address;

Here is the result:

Of course, there are a number of limitations in such a primitive implementation of the function. For example, each of the 4 numbers in the IP address is generated in the range from 100 to 199, although it should be in the range from 0 to 255. But for some test cases, this may be acceptable.

Randomization function with a subset

I would like to give here one more overloaded implementation of this function:

CREATE OR REPLACE FUNCTION rand_format(str text, subset text)
RETURNS text
LANGUAGE sql
AS $function$
       select
         array_to_string(
           array_agg(
             replace(x, 'A', substring(subset from trunc(random()*length(subset) + 1)::int for 1))
           ),
         '')
       from (select regexp_split_to_table(str, '') as x) a
$function$
;

This implementation takes 2 parameters as input - a template string and a set of characters to replace. Thus, all A's will be replaced with random ones from the pattern.

Examples of this function call:

select
	rand_format('AAAA:AAAA:AAAA:AAAA:AAAA:AAAA:AAAA:AAAA', '0123456789abcdef') as ip_v6,
	rand_format('AAAAAAAA AAAAAAAA AAAAAAAA AAAAAAAA', 'abcdefghijklmnopqrstuvwxyz ') secret_phrase;

Here we can see how this function works:

Extending the Query

Let's add all of the above to our main query, and also create an outer query so that we can combine test columns:

select
	concat(firstname, ' ', middlename, ' ', lastname) as client_name,
	concat(firstname, middlename, lastname, '@', case (random()*2)::int when 0 then 'gmail' when 1 then 'hotmail' when 2 then 'yahoo' end || '.com') as email,
	concat(house, ', ', street, ' ', street_type, ', ', city, ', ', state) as address,
	birth_date, ssn, amount, phone,
	rand_format('(999) 999-9999') as phone_number2,
	rand_format('AAA-9999') as plate_number,
	rand_format('99 99 999999') as passport_number,
	rand_format('99 AA 999999') as driver_licence,
	rand_format('199.199.199.199') as ip_address,
	rand_format('AAAA:AAAA:AAAA:AAAA:AAAA:AAAA:AAAA:AAAA', '0123456789abcdef') as ip_v6,
	rand_format('AAAAAAAA AAAAAAAA AAAAAAAA AAAAAAAA', 'abcdefghijklmnopqrstuvwxyz ') secret_phrase
from (
select s as id,
    arrays.firstnames[trunc(random() * ARRAY_LENGTH(arrays.firstnames, 1) + 1)] AS firstname,
    substring('ABCDEFGHIJKLMNOPQRSTUVWXYZ' from trunc(random() * 26 + 1)::int for 1) AS middlename,
    arrays.lastnames[trunc(random() * ARRAY_LENGTH(arrays.lastnames,1) + 1)] AS lastname,
	date(now() - trunc(random() * 365 * 82 /*max age + 18*/) * '1 day'::interval - interval '18 year' /* min age*/) as birth_date,
	rand_format('999-99-9999') as ssn,
	round((random() * 100000)::numeric, 2) as amount,
	(100 + random() * 9900)::int as house,
	arrays.streets[trunc(random() * ARRAY_LENGTH(arrays.streets, 1) + 1)]  AS street,
	case (random() * 2)::int when 0 then 'St.' when 1 then 'Ave.' when 2 then 'Rd.' end as street_type,
	arrays.cities[trunc(random() * ARRAY_LENGTH(arrays.cities, 1) + 1)] AS city,
	arrays.states[trunc(random() * ARRAY_LENGTH(arrays.states, 1) + 1)] AS state,
	concat(FLOOR(100 + random() * 900), ' ', FLOOR(100 + random() * 900), ' ', FLOOR(1000 + random() * 9000)) as phone
from generate_series(1, 1000) AS s -- number of rows to generate
CROSS JOIN(
    SELECT ARRAY[
		'Adam', 'Bill', 'Bob', 'Donald', 'Frank', 'George', 'James', 'John', 'Jacob', 'Jack', 'Martin', 'Matthew', 'Max', 'Michael', 'Paul','Peter', 'Ronald',
		'Samuel','Steve','William', 'Abigail', 'Alice', 'Amanda', 'Barbara','Betty', 'Carol', 'Donna', 'Jane','Jennifer','Julie','Mary','Melissa','Sarah','Susan'
    ] AS firstnames,
    ARRAY[
        'Matthews','Smith','Jones','Davis','Jacobson','Williams','Donaldson','Maxwell','Peterson','Stevens', 'Franklin','Washington','Jefferson','Adams',
        'Jackson','Johnson','Lincoln','Grant','Fillmore','Harding','Taft', 'Truman','Nixon','Ford','Carter','Reagan','Bush','Clinton','Hancock'
    ] AS lastnames,
    ARRAY[
		'Green', 'Smith', 'Church', 'Grant', 'Cedar', 'Forest', 'Frankl', 'Birch', 'Jones', 'Brown',
		'Cherry', 'Willow', 'Rose', 'School', 'Wilson', 'Center', 'Walnut', 'Mill', 'Valley'
    ] as streets,
    ARRAY[
		'Washington', 'Franklin', 'Clinton', 'Georgetown', 'Springfield', 'Chester', 'Greenville', 'Dayton', 'Madison', 'Salem',
		'Winchester', 'Oakland', 'Milton', 'Newport', 'Ashland', 'Riverside', 'Manchester', 'Oxford', 'Burlington', 'Jackson', 'Milford'
    ] as cities,
    ARRAY[
		'AK','AL','AR','AZ','CA','CO','CT','DC','DE','FL','GA','HI','IA','ID','IL','IN','KS','KY','LA','MA','MD','ME','MI','MN','MS','MO',
		'MT','NC','NE','NH','NJ','NM','NV','NY','ND','OH','OK','OR','PA','RI','SC','SD','TN','TX','UT','VT','VA','WA','WV','WI','WY'
    ] as states
) AS arrays
) test_data;

And our result test set is:

With this query, you can combine different fields with each other to get the values of new ones. For example, we used firstname, middlename, and lastname to generate values for the email field.

Note that we also used the rand_format function to generate the SSN in the format 123-45-6789.

Using in MySQL

For MySQL version 8, you can use the CTE to generate series:

WITH RECURSIVE s as
	(SELECT 1 AS n UNION ALL SELECT n + 1 AS value FROM s WHERE s.n < 1000)
SELECT n from s;

If you have MySQL version 5, then to generate a sequence, you can create a table and add data to it:

create table t(f int);
insert into t select 1;
insert into t select * from t; -- execute this line multiple times

If it is not possible to use CTE and if you do not want to create tables, you can use a table from information_schema, for example, columns:

select * from information_schema.COLUMNS c; -- in my database I get 3652 rows

If you need to generate more records in one query, then you can join the table with itself:

-- in my case this query returns more than 13 million of rows
select * from information_schema.COLUMNS c cross join information_schema.COLUMNS c1;

Although, this method may not be the best solution in terms of performance.

In order for the above queries to work correctly in MySQL, you also need to:

replace random() with rand()
replace type casting with a round or floor function
write arrays via union
change the way ID is generated

Example of query in MySQL using information_schema:

SET @row_number = 0;

select  @row_number := @row_number + 1 as id, a.*
from
(select
	firstnames.firstname, lastnames.lastname,
	100000000 + round(rand() * 900000000) as ssn,
	round((rand() * 100000), 2) as amount,
	round(100 + rand() * 9900) as house,
	case (round(rand() * 2)) when 0 then 'St.' when 1 then 'Ave.' when 2 then 'Rd.' end as street_type,
	concat(FLOOR(100 + rand() * 900), ' ', FLOOR(100 + rand() * 900), ' ', FLOOR(1000 + rand() * 9000)) as phone
from (
    select 'Adam' as firstname
    union select 'Bill' union select 'Bob' union select 'Donald' union select 'Frank' union select 'George'
	union select 'James' union select 'John' union select 'Jacob' union select 'Jack' union select 'Martin' union select 'Matthew'
	union select 'Max' union select 'Michael' union select 'Paul' union select 'Ronald'
 ) firstnames,
 (
	select 'Matthews' as lastname union select 'Smith' union select 'Jones' union select 'Davis' union select 'Jacobson' union select 'Williams' union select 'Donaldson'
	union select 'Maxwell' union select 'Peterson' union select 'Stevens' union select 'Franklin' union select 'Washington' union select 'Jefferson'
	union select 'Adams' union select 'Jackson' union select 'Johnson' union select 'Lincoln' union select 'Grant' union select 'Fillmore' union select 'Harding'
	union select 'Taft' union select 'Truman' union select 'Nixon' union select 'Ford' union select 'Carter' union select 'Reagan' union select 'Bush'
	union select 'Clinton' union select 'Hancock'
 ) lastnames
join information_schema.columns c
order by rand()
) a
;

And its result: