Most web apps/services that use a relational database are built around a web framework and an Object-Relational Mapping (ORM) library, which typically have conventions that prescribe how to create and load test fixtures/data into the database for testing. If you're building a webapp without an ORM [1], the story for how to create and load test data is less clear. What tools and approaches are available, and which work best? There are a lot of articles around the internet that describe specific techniques or example code in isolation, but few that provide a broader survey of the many different approaches that are possible. I hope this article will help fill that gap, exploring and discussing different approaches for creating and loading test data in PostgreSQL.
[1] Wait a minute, why would you build a webapp without an ORM?! This question could spawn an entire article of its own and in fact, many other articles have debated about ORMs for the last couple decades. I won't dive into that debate — it's up to the creator to decide if a project should use an ORM or not, and that decision depends on a lot of project-specific factors, such as the expertise of the creator and their team, the types and velocity of data involved, the performance and scaling requirements, and much more.
If you're interested in generating test data instead of (or in addition to) loading test data, please check back later this week for a follow-up article that explores generating test data for PostgreSQL using SQL, PL/pgSQL, and Python!
Want to follow along? I've collected sample data and scripts in a subfolder of our Tangram Vision blog repo: https://gitlab.com/tangram-vision-oss/tangram-visions-blog/-/tree/main/2021.04.28_LoadingTestDataIntoPostgreSQL
As described in the repo's README, you can run examples using the official Postgres Docker image with:
# The base postgres image requires a password to be set, but we'll just be
# testing locally, so no need to set a strong password.
docker run --name=postgres --rm --env=POSTGRES_PASSWORD=foo \
--volume=$(pwd)/schema.sql:/docker-entrypoint-initdb.d/schema.sql \
--volume=$(pwd):/repo
postgres:latest -c log_statement=all
To explain this Docker command a bit:
POSTGRES_PASSWORD
environment variable), but we'll just be testing locally, so no need to set a strong password.*.sh
and *.sql
files) in the /docker-entrypoint-initdb.d
folder inside the container will be executed as PostgreSQL starts up. The above command mounts schema.sql
into that folder, so the database tables will be created./repo
inside the container, so example SQL and CSV files are accessible.log_statement=all
config override, which increases the logging verbosity.The repo contains a variety of files that start with
add-data-
which demonstrate different ways of loading and generating test data. After the Postgres Docker container is running, you can run add-data-
files in a new terminal window with a command like:docker exec --workdir=/repo postgres \
psql --host=localhost --username=postgres \
--file=add-data-sql-copy-csv.sql
If you want to interactively poke around the database with
psql
, use:docker exec --interactive --tty postgres \
psql --host=localhost --username=postgres
For example code and data, I'll use the following simple schema:
Sample schema relating musical artists, albums, and genres.
The simplest way to get test data into PostgreSQL is to make a static dataset, which you can save as CSV files or embed in SQL files directly.
In the code repo accompanying this blogpost, there are 4 small CSV files, one for each table of the sample schema. The CSV files contain headers and data rows as shown in the image below.
A small, static sample dataset of musical artists, albums, and genres.
We can import the data from these CSV files into a PostgreSQL database with the SQL COPY command:
-- Excerpt from add-data-copy-csv.sql in the sample code repo
COPY artists FROM '/repo/artists.csv' CSV HEADER;
COPY albums FROM '/repo/albums.csv' CSV HEADER;
COPY genres FROM '/repo/genres.csv' CSV HEADER;
COPY album_genres FROM '/repo/album_genres.csv' CSV HEADER
The COPY command has a variety of options for controlling quoting, delimiters, escape characters, and more. You can even limit which rows are imported with a WHERE clause. One potential downside is you must run it as a database superuser or as a user with permissions to read and write and execute files on the server — this isn't a concern when loading data for local testing, but keep it in mind if you ever want to use it in a more restrictive or production-like environment.
The PostgreSQL interactive terminal (called psql) provides a copy command that is very similar to SQL COPY:
-- Excerpt from add-data-copy-csv.psql in the sample code repo
\copy artists from 'artists.csv' csv header
\copy albums from 'albums.csv' csv header
\copy genres from 'genres.csv' csv header
\copy album_genres from 'album_genres.csv' csv header
There are some important differences between SQL COPY and psql copy:
\
) and doesn't need to end with a semicolon (;
).As an alternative to storing data in separate CSV files (which are loaded with SQL or psql commands), you can store data in SQL files directly.
SQL COPY from stdin and pg_dump
The SQL COPY and psql copy commands can load data from stdin instead of a file. They will parse and load all the lines between the copy command and
\.
as rows of data.-- Excerpt from add-data-copy-stdin.sql in the sample code repo
COPY public.artists (artist_id, name) FROM stdin CSV;
1,"DJ Okawari"
2,"Steely Dan"
3,"Missy Elliott"
4,"TWRP"
5,"Donald Fagen"
6,"La Luz"
7,"Ella Fitzgerald"
\.
COPY public.albums (album_id, artist_id, title, released) FROM stdin CSV;
1,1,"Mirror",2009-06-24
2,2,"Pretzel Logic",1974-02-20
3,3,"Under Construction",2002-11-12
4,4,"Return to Wherever",2019-07-11
5,5,"The Nightfly",1982-10-01
6,6,"It's Alive",2013-10-15
7,7,"Pure Ella",1994-02-15
\.
...
In fact, this
COPY ... FROM stdin
approach is how [pg_dump](<https://www.postgresql.org/docs/current/app-pgdump.html>)
outputs data if you're creating a dump or backup from an existing PostgreSQL database. However, pg_dump
uses a tab-separated format by default, rather than the comma-separated format shown above.By default,
pg_dump
also outputs SQL to re-create everything about the database (tables, constraints, views, functions, reset sequences, etc.), but you can instruct it to output only data with the --data-only
flag. To try out pg_dump
with the example Docker image, run:docker exec --workdir=/repo postgres \
pg_dump --host=localhost --username=postgres postgres
SQL INSERTs
Another way to put data directly in SQL is to use INSERT statements. This approach could look like the following:
-- Excerpt from add-data-insert-static-ids.sql in the sample code repo
INSERT INTO artists (artist_id, name)
OVERRIDING SYSTEM VALUE
VALUES
(1, 'DJ Okawari'),
(2, 'Steely Dan'),
(3, 'Missy Elliott'),
(4, 'TWRP'),
(5, 'Donald Fagen'),
(6, 'La Luz'),
(7, 'Ella Fitzgerald');
INSERT INTO albums (album_id, artist_id, title, released)
OVERRIDING SYSTEM VALUE
VALUES
(1, 1, 'Mirror', '2009-06-24'),
(2, 2, 'Pretzel Logic', '1974-02-20'),
(3, 3, 'Under Construction', '2002-11-12'),
(4, 4, 'Return to Wherever', '2019-07-11'),
(5, 5, 'The Nightfly', '1982-10-01'),
(6, 6, 'It''s Alive', '2013-10-15'),
(7, 7, 'Pure Ella', '1994-02-15');
...
The
OVERRIDING SYSTEM VALUE
clause lets us INSERT values into the primary key ID columns explicitly even though they are defined as GENERATED ALWAYS
.The
pg_dump
command's --column-inserts
option will output data as INSERT statements (a separate statement per row), rather than as the default TSV format. Using INSERTs instead of COPY will run much slower when restoring the data, so this is only recommended if you're restoring the data to a database that doesn't support COPY, such as sqlite3. Using INSERTs can be sped up somewhat with the --rows-per-insert
option, allowing you to INSERT many rows at a time per command, reducing the overhead of back-and-forth communication between client and server for every SQL statement.Using INSERT statements, we could start moving away from statically declaring everything about our datasets — we could omit the primary key ID columns and lookup IDs as needed when inserting foreign keys, as in the following example:
-- Excerpt from add-data-insert-queried-ids.sql in the sample code repo
INSERT INTO artists (name)
VALUES
('DJ Okawari'),
('Steely Dan'),
('Missy Elliott'),
('TWRP'),
('Donald Fagen'),
('La Luz'),
('Ella Fitzgerald');
INSERT INTO albums (artist_id, title, released)
VALUES
((SELECT id FROM artists WHERE name = 'DJ Okawari'), 'Mirror', '2009-06-24'),
((SELECT id FROM artists WHERE name = 'Steely Dan'), 'Pretzel Logic', '1974-02-20'),
((SELECT id FROM artists WHERE name = 'Missy Elliott'), 'Under Construction', '2002-11-12'),
((SELECT id FROM artists WHERE name = 'TWRP'), 'Return to Wherever', '2019-07-11'),
((SELECT id FROM artists WHERE name = 'Donald Fagen'), 'The Nightfly', '1982-10-01'),
((SELECT id FROM artists WHERE name = 'La Luz'), 'It''s Alive', '2013-10-15'),
((SELECT id FROM artists WHERE name = 'Ella Fitzgerald'), 'Pure Ella', '1994-02-15');
...
This is hardly convenient, though, because we need to duplicate other row information (such as the artist name) in order to look up the corresponding ID. It gets even more complex if multiple artists have the same name! So, if you have a static dataset I'd suggest sticking to one of the previously mentioned approaches that use SQL COPY or psql copy.
Is there a reason to prefer putting static datasets in CSVs or directly in SQL files? My thoughts boil down to the following points:
I hope you learned something new and useful about the different approaches and tools available for loading static datasets into PostgreSQL. Check back soon for the follow-up article that explores generating test data for PostgreSQL using SQL, PL/pgSQL, and Python!
If you have any suggestions or corrections, please let me know or send us a tweet, and if you’re curious to learn more about how we improve perception sensors, visit us at Tangram Vision.
Previously published at https://www.tangramvision.com/blog/loading-test-data-into-postgresql