paint-brush
How to Efficiently Generate Test Data With SQLby@yuridanilov
7,129 reads
7,129 reads

How to Efficiently Generate Test Data With SQL

by Yuri DanilovSeptember 26th, 2022
Read on Terminal Reader
Read this story w/o Javascript
tldt arrow

Too Long; Didn't Read

There are a large number of utilities for generating test data. But in some cases you may wish to generate this data manually. I would like to share my own experience in this area and hope that the examples that I will cover in this article will help you with this.

Companies Mentioned

Mention Thumbnail
Mention Thumbnail

Coin Mentioned

Mention Thumbnail
featured image - How to Efficiently Generate Test Data With SQL
Yuri Danilov HackerNoon profile picture


There are a large number of utilities for generating test data. But in some cases, you may wish to generate this data manually.


I would like to share my own experience in this area and hope that the examples that I will cover in this article will help you with this.


You can always use database-specific functions and they will probably work more efficiently, but I use primitive functions to make the code as portable as possible.


The examples below are for PostgreSQL but can be adapted for other databases with some modifications. Later I will talk about this.


Series

First of all, we need a query generating a specified number of rows. I prefer to use the generate_series function for this:


select s as id from generate_series(1, 1000) AS s; -- number of rows to generate


The result set contains 1000 rows:


IDs


First and last name

Let's say we want to generate the names of our clients.


We can use a couple of arrays containing first names and last names and randomly select values from them:


select
    arrays.firstnames[trunc(random() * ARRAY_LENGTH(arrays.firstnames, 1) + 1)] AS firstname,
    arrays.lastnames[trunc(random() * ARRAY_LENGTH(arrays.lastnames,1) + 1)] AS lastname
from  (
    SELECT ARRAY[
		'Adam', 'Bill', 'Bob', 'Donald', 'Frank', 'George', 'James', 'John', 'Jacob', 'Jack', 'Martin', 'Matthew', 'Max', 'Michael', 'Paul','Peter', 'Ronald',
		'Samuel','Steve','William', 'Abigail', 'Alice', 'Amanda', 'Barbara','Betty', 'Carol', 'Donna', 'Jane','Jennifer','Julie','Mary','Melissa','Sarah','Susan'
    ] AS firstnames,
    ARRAY[
        'Matthews','Smith','Jones','Davis','Jacobson','Williams','Donaldson','Maxwell','Peterson','Stevens', 'Franklin','Washington','Jefferson','Adams',
        'Jackson','Johnson','Lincoln','Grant','Fillmore','Harding','Taft', 'Truman','Nixon','Ford','Carter','Reagan','Bush','Clinton','Hancock'
    ] AS lastnames
) AS arrays;


The result:

First and last name


Here we use the expression that generates a random index within the array length:


trunc(random() * ARRAY_LENGTH(arrays.firstnames, 1) + 1)


Connecting arrays with series

Combined with the generate_series function, our query can now generate 1000 rows of random first and last names with IDs:


select s as id,
    arrays.firstnames[trunc(random() * ARRAY_LENGTH(arrays.firstnames, 1) + 1)] AS firstname,
    arrays.lastnames[trunc(random() * ARRAY_LENGTH(arrays.lastnames,1) + 1)] AS lastname
from generate_series(1, 1000) AS s -- number of rows to generate
CROSS JOIN(
    SELECT ARRAY[
		'Adam', 'Bill', 'Bob', 'Donald', 'Frank', 'George', 'James', 'John', 'Jacob', 'Jack', 'Martin', 'Matthew', 'Max', 'Michael', 'Paul','Peter', 'Ronald',
		'Samuel','Steve','William', 'Abigail', 'Alice', 'Amanda', 'Barbara','Betty', 'Carol', 'Donna', 'Jane','Jennifer','Julie','Mary','Melissa','Sarah','Susan'
    ] AS firstnames,
    ARRAY[
        'Matthews','Smith','Jones','Davis','Jacobson','Williams','Donaldson','Maxwell','Peterson','Stevens', 'Franklin','Washington','Jefferson','Adams',
        'Jackson','Johnson','Lincoln','Grant','Fillmore','Harding','Taft', 'Truman','Nixon','Ford','Carter','Reagan','Bush','Clinton','Hancock'
    ] AS lastnames
) AS arrays;


And an example of the result:


First and last names


Random letter for a middle name

Suppose we also need a random letter for a middle name, then we can use the following query:


select substring('ABCDEFGHIJKLMNOPQRSTUVWXYZ' from trunc(random() * 26 + 1)::int for 1) AS middlename;


We take a string containing all the letters of the alphabet and choose a random character from it.

The result:


Middle name


Date of birth

What other fields might we need? Maybe a date of birth.


The following expression returns a date between 18 and 100(approximately) years ago:


select date(now() - trunc(random() * 365 * 82 /*max age + 18*/) * '1 day'::interval - interval '18 year' /* min age*/) as birth_date;


Example of birth date:


Date of birth


You can check the correctness of the expression by replacing the random with 1. This expression returns us to date approximately 100 years ago:


select date(now() - trunc(1 * 365 * 82 /*max age + 18*/) * '1 day'::interval - interval '18 year' /* min age*/) as birth_date;


"Approximately" - because we don't take into account leap years.


SSN

SSN is a 9-digit value that can be obtained in the following way:


select 100000000 + round(random() * 900000000) as ssn;


SSN:


SSN


The SSN is usually written in the format 999-99-9999. A little later we will see how to do this.


Money

Also, for financial institutions, the amount on the client's account matters.


Often it is represented by 2 decimal places, so we will specify an interval and round it up:


select round((random() * 100000)::numeric, 2) as amount;


The result:

Money amount


Address and phone

Often you need information about the customer's address and phone.


An address usually consists of a house number, street, city, and state.


Let's add the corresponding arrays to our query:


    ARRAY[
		'Green', 'Smith', 'Church', 'Grant', 'Cedar', 'Forest', 'Frankl', 'Birch', 'Jones', 'Brown',
		'Cherry', 'Willow', 'Rose', 'School', 'Wilson', 'Center', 'Walnut', 'Mill', 'Valley'
    ] as streets,
    ARRAY[
		'Washington', 'Franklin', 'Clinton', 'Georgetown', 'Springfield', 'Chester', 'Greenville', 'Dayton', 'Madison', 'Salem',
		'Winchester', 'Oakland', 'Milton', 'Newport', 'Ashland', 'Riverside', 'Manchester', 'Oxford', 'Burlington', 'Jackson', 'Milford'
    ] as cities,
    ARRAY[
		'AK','AL','AR','AZ','CA','CO','CT','DC','DE','FL','GA','HI','IA','ID','IL','IN','KS','KY','LA','MA','MD','ME','MI','MN','MS','MO',
		'MT','NC','NE','NH','NJ','NM','NV','NY','ND','OH','OK','OR','PA','RI','SC','SD','TN','TX','UT','VT','VA','WA','WV','WI','WY'
    ] as states


The street name is often followed by its type. We can use the case operator with a random switch parameter like this:


select case (random() * 2)::int when 0 then 'St.' when 1 then 'Ave.' when 2 then 'Rd.' end as street_type;


The result:


Street type


The case construct is useful when you want to randomize a small number of distinct values.


A house number is just a random number, let's say between 100 and 10000:


select (100 + random() * 9900)::int as house;


The result:

House number


A phone number in the 'XXX XXX XXXX' format can be generated like this:


select concat(FLOOR(100 + random() * 900), ' ', FLOOR(100 + random() * 900), ' ', FLOOR(1000 + random() * 9000)) as phone;


Phone number:



Putting it all together

Now we put everything together and we get the following query:


select s as id,
    arrays.firstnames[trunc(random() * ARRAY_LENGTH(arrays.firstnames, 1) + 1)] AS firstname,
    substring('ABCDEFGHIJKLMNOPQRSTUVWXYZ' from trunc(random() * 26 + 1)::int for 1) AS middlename,
    arrays.lastnames[trunc(random() * ARRAY_LENGTH(arrays.lastnames,1) + 1)] AS lastname,
	date(now() - trunc(random() * 365 * 82 /*max age + 18*/) * '1 day'::interval - interval '18 year' /* min age*/) as birth_date,
	100000000 + round(random() * 900000000) as ssn,
	round((random() * 100000)::numeric, 2) as amount,
	(100 + random() * 9900)::int as house,
	arrays.streets[trunc(random() * ARRAY_LENGTH(arrays.streets, 1) + 1)]  AS street,
	case (random() * 2)::int when 0 then 'St.' when 1 then 'Ave.' when 2 then 'Rd.' end as street_type,
	arrays.cities[trunc(random() * ARRAY_LENGTH(arrays.cities, 1) + 1)] AS city,
	arrays.states[trunc(random() * ARRAY_LENGTH(arrays.states, 1) + 1)] AS state,
	concat(FLOOR(100 + random() * 900), ' ', FLOOR(100 + random() * 900), ' ', FLOOR(1000 + random() * 9000)) as phone
from generate_series(1, 1000) AS s -- number of rows to generate
CROSS JOIN(
    SELECT ARRAY[
		'Adam', 'Bill', 'Bob', 'Donald', 'Frank', 'George', 'James', 'John', 'Jacob', 'Jack', 'Martin', 'Matthew', 'Max', 'Michael', 'Paul','Peter', 'Ronald',
		'Samuel','Steve','William', 'Abigail', 'Alice', 'Amanda', 'Barbara','Betty', 'Carol', 'Donna', 'Jane','Jennifer','Julie','Mary','Melissa','Sarah','Susan'
    ] AS firstnames,
    ARRAY[
        'Matthews','Smith','Jones','Davis','Jacobson','Williams','Donaldson','Maxwell','Peterson','Stevens', 'Franklin','Washington','Jefferson','Adams',
        'Jackson','Johnson','Lincoln','Grant','Fillmore','Harding','Taft', 'Truman','Nixon','Ford','Carter','Reagan','Bush','Clinton','Hancock'
    ] AS lastnames,
    ARRAY[
		'Green', 'Smith', 'Church', 'Grant', 'Cedar', 'Forest', 'Frankl', 'Birch', 'Jones', 'Brown',
		'Cherry', 'Willow', 'Rose', 'School', 'Wilson', 'Center', 'Walnut', 'Mill', 'Valley'
    ] as streets,
    ARRAY[
		'Washington', 'Franklin', 'Clinton', 'Georgetown', 'Springfield', 'Chester', 'Greenville', 'Dayton', 'Madison', 'Salem',
		'Winchester', 'Oakland', 'Milton', 'Newport', 'Ashland', 'Riverside', 'Manchester', 'Oxford', 'Burlington', 'Jackson', 'Milford'
    ] as cities,
    ARRAY[
		'AK','AL','AR','AZ','CA','CO','CT','DC','DE','FL','GA','HI','IA','ID','IL','IN','KS','KY','LA','MA','MD','ME','MI','MN','MS','MO',
		'MT','NC','NE','NH','NJ','NM','NV','NY','ND','OH','OK','OR','PA','RI','SC','SD','TN','TX','UT','VT','VA','WA','WV','WI','WY'
    ] as states
) AS arrays;


And the result set:


The test set containing randomly generated user data


Template string randomization function

We used a concatenation of several random values to generate a phone number, but in some cases, this method is not very convenient.


If you have a large number of columns whose values must match a certain pattern, you can use the function:


CREATE OR REPLACE FUNCTION rand_format(str text)
RETURNS text
LANGUAGE sql
AS $function$
       select
         array_to_string(
           array_agg(
             replace(replace(x, 'A', substring('ABCDEFGHIJKLMNOPQRSTUVWXYZ' from trunc(random()*26 + 1)::int for 1)), '9', trunc(random()*9 + 1)::text)
           ),
         '')
       from (select regexp_split_to_table(str, '') as x) a
$function$
;


The function takes a string containing a pattern and replaces 9 with a random number and the letter A with a random alphabetic character.


Examples of how to use it:


select
	rand_format('(999) 999-9999') as phone_number,
	rand_format('AAA-9999') as plate_number,
	rand_format('99 99 999999') as passport_number,
	rand_format('99 AA 999999') as driver_licence,
	rand_format('199.199.199.199') as ip_address;


Here is the result:


Examples of using the rand_format function


Of course, there are a number of limitations in such a primitive implementation of the function. For example, each of the 4 numbers in the IP address is generated in the range from 100 to 199, although it should be in the range from 0 to 255. But for some test cases, this may be acceptable.


Randomization function with a subset

I would like to give here one more overloaded implementation of this function:


CREATE OR REPLACE FUNCTION rand_format(str text, subset text)
RETURNS text
LANGUAGE sql
AS $function$
       select
         array_to_string(
           array_agg(
             replace(x, 'A', substring(subset from trunc(random()*length(subset) + 1)::int for 1))
           ),
         '')
       from (select regexp_split_to_table(str, '') as x) a
$function$
;


This implementation takes 2 parameters as input - a template string and a set of characters to replace. Thus, all A's will be replaced with random ones from the pattern.


Examples of this function call:


select
	rand_format('AAAA:AAAA:AAAA:AAAA:AAAA:AAAA:AAAA:AAAA', '0123456789abcdef') as ip_v6,
	rand_format('AAAAAAAA AAAAAAAA AAAAAAAA AAAAAAAA', 'abcdefghijklmnopqrstuvwxyz ') secret_phrase;


Here we can see how this function works:


Examples of using the rand_format function


Extending the Query

Let's add all of the above to our main query, and also create an outer query so that we can combine test columns:


select
	concat(firstname, ' ', middlename, ' ', lastname) as client_name,
	concat(firstname, middlename, lastname, '@', case (random()*2)::int when 0 then 'gmail' when 1 then 'hotmail' when 2 then 'yahoo' end || '.com') as email,
	concat(house, ', ', street, ' ', street_type, ', ', city, ', ', state) as address,
	birth_date, ssn, amount, phone,
	rand_format('(999) 999-9999') as phone_number2,
	rand_format('AAA-9999') as plate_number,
	rand_format('99 99 999999') as passport_number,
	rand_format('99 AA 999999') as driver_licence,
	rand_format('199.199.199.199') as ip_address,
	rand_format('AAAA:AAAA:AAAA:AAAA:AAAA:AAAA:AAAA:AAAA', '0123456789abcdef') as ip_v6,
	rand_format('AAAAAAAA AAAAAAAA AAAAAAAA AAAAAAAA', 'abcdefghijklmnopqrstuvwxyz ') secret_phrase
from (
select s as id,
    arrays.firstnames[trunc(random() * ARRAY_LENGTH(arrays.firstnames, 1) + 1)] AS firstname,
    substring('ABCDEFGHIJKLMNOPQRSTUVWXYZ' from trunc(random() * 26 + 1)::int for 1) AS middlename,
    arrays.lastnames[trunc(random() * ARRAY_LENGTH(arrays.lastnames,1) + 1)] AS lastname,
	date(now() - trunc(random() * 365 * 82 /*max age + 18*/) * '1 day'::interval - interval '18 year' /* min age*/) as birth_date,
	rand_format('999-99-9999') as ssn,
	round((random() * 100000)::numeric, 2) as amount,
	(100 + random() * 9900)::int as house,
	arrays.streets[trunc(random() * ARRAY_LENGTH(arrays.streets, 1) + 1)]  AS street,
	case (random() * 2)::int when 0 then 'St.' when 1 then 'Ave.' when 2 then 'Rd.' end as street_type,
	arrays.cities[trunc(random() * ARRAY_LENGTH(arrays.cities, 1) + 1)] AS city,
	arrays.states[trunc(random() * ARRAY_LENGTH(arrays.states, 1) + 1)] AS state,
	concat(FLOOR(100 + random() * 900), ' ', FLOOR(100 + random() * 900), ' ', FLOOR(1000 + random() * 9000)) as phone
from generate_series(1, 1000) AS s -- number of rows to generate
CROSS JOIN(
    SELECT ARRAY[
		'Adam', 'Bill', 'Bob', 'Donald', 'Frank', 'George', 'James', 'John', 'Jacob', 'Jack', 'Martin', 'Matthew', 'Max', 'Michael', 'Paul','Peter', 'Ronald',
		'Samuel','Steve','William', 'Abigail', 'Alice', 'Amanda', 'Barbara','Betty', 'Carol', 'Donna', 'Jane','Jennifer','Julie','Mary','Melissa','Sarah','Susan'
    ] AS firstnames,
    ARRAY[
        'Matthews','Smith','Jones','Davis','Jacobson','Williams','Donaldson','Maxwell','Peterson','Stevens', 'Franklin','Washington','Jefferson','Adams',
        'Jackson','Johnson','Lincoln','Grant','Fillmore','Harding','Taft', 'Truman','Nixon','Ford','Carter','Reagan','Bush','Clinton','Hancock'
    ] AS lastnames,
    ARRAY[
		'Green', 'Smith', 'Church', 'Grant', 'Cedar', 'Forest', 'Frankl', 'Birch', 'Jones', 'Brown',
		'Cherry', 'Willow', 'Rose', 'School', 'Wilson', 'Center', 'Walnut', 'Mill', 'Valley'
    ] as streets,
    ARRAY[
		'Washington', 'Franklin', 'Clinton', 'Georgetown', 'Springfield', 'Chester', 'Greenville', 'Dayton', 'Madison', 'Salem',
		'Winchester', 'Oakland', 'Milton', 'Newport', 'Ashland', 'Riverside', 'Manchester', 'Oxford', 'Burlington', 'Jackson', 'Milford'
    ] as cities,
    ARRAY[
		'AK','AL','AR','AZ','CA','CO','CT','DC','DE','FL','GA','HI','IA','ID','IL','IN','KS','KY','LA','MA','MD','ME','MI','MN','MS','MO',
		'MT','NC','NE','NH','NJ','NM','NV','NY','ND','OH','OK','OR','PA','RI','SC','SD','TN','TX','UT','VT','VA','WA','WV','WI','WY'
    ] as states
) AS arrays
) test_data;


And our result test set is:


With this query, you can combine different fields with each other to get the values of new ones. For example, we used firstname, middlename, and lastname to generate values for the email field.


Note that we also used the rand_format function to generate the SSN in the format 123-45-6789.


Using in MySQL

For MySQL version 8, you can use the CTE to generate series:


WITH RECURSIVE s as
	(SELECT 1 AS n UNION ALL SELECT n + 1 AS value FROM s WHERE s.n < 1000)
SELECT n from s;


If you have MySQL version 5, then to generate a sequence, you can create a table and add data to it:


create table t(f int);
insert into t select 1;
insert into t select * from t; -- execute this line multiple times


If it is not possible to use CTE and if you do not want to create tables, you can use a table from information_schema, for example, columns:


select * from information_schema.COLUMNS c; -- in my database I get 3652 rows


If you need to generate more records in one query, then you can join the table with itself:


-- in my case this query returns more than 13 million of rows
select * from information_schema.COLUMNS c cross join information_schema.COLUMNS c1;


Although, this method may not be the best solution in terms of performance.

In order for the above queries to work correctly in MySQL, you also need to:

  • replace random() with rand()
  • replace type casting with a round or floor function
  • write arrays via union
  • change the way ID is generated


Example of query in MySQL using information_schema:


SET @row_number = 0;

select  @row_number := @row_number + 1 as id, a.*
from
(select
	firstnames.firstname, lastnames.lastname,
	100000000 + round(rand() * 900000000) as ssn,
	round((rand() * 100000), 2) as amount,
	round(100 + rand() * 9900) as house,
	case (round(rand() * 2)) when 0 then 'St.' when 1 then 'Ave.' when 2 then 'Rd.' end as street_type,
	concat(FLOOR(100 + rand() * 900), ' ', FLOOR(100 + rand() * 900), ' ', FLOOR(1000 + rand() * 9000)) as phone
from (
    select 'Adam' as firstname
    union select 'Bill' union select 'Bob' union select 'Donald' union select 'Frank' union select 'George'
	union select 'James' union select 'John' union select 'Jacob' union select 'Jack' union select 'Martin' union select 'Matthew'
	union select 'Max' union select 'Michael' union select 'Paul' union select 'Ronald'
 ) firstnames,
 (
	select 'Matthews' as lastname union select 'Smith' union select 'Jones' union select 'Davis' union select 'Jacobson' union select 'Williams' union select 'Donaldson'
	union select 'Maxwell' union select 'Peterson' union select 'Stevens' union select 'Franklin' union select 'Washington' union select 'Jefferson'
	union select 'Adams' union select 'Jackson' union select 'Johnson' union select 'Lincoln' union select 'Grant' union select 'Fillmore' union select 'Harding'
	union select 'Taft' union select 'Truman' union select 'Nixon' union select 'Ford' union select 'Carter' union select 'Reagan' union select 'Bush'
	union select 'Clinton' union select 'Hancock'
 ) lastnames
join information_schema.columns c
order by rand()
) a
;


And its result:


MySQL result


Conclusion

For other DBMS, there may also be some differences, but in general, this approach can be used in any other relational DBMS.


By having this query handy or by creating a view based on it, you will always be equipped with a fast and efficient way to generate test data.