Most web apps/services that use a relational database are built around a web framework and an Object-Relational Mapping (ORM) library, which typically have conventions that prescribe how to create and load test fixtures/data into the database for testing. If you're building a webapp without an ORM [1], the story for how to create and load test data is less clear. What tools and approaches are available, and which work best? There are a lot of articles around the internet that describe specific techniques or example code in isolation, but few that provide a broader survey of the many different approaches that are possible. I hope this article will help fill that gap, exploring and discussing different approaches for creating and loading test data in PostgreSQL. [1] Wait a minute, why would you build a webapp without an ORM?! This question could spawn an entire article of its own and in fact, for the last couple decades. I won't dive into that debate — it's up to the creator to decide if a project should use an ORM or not, and that decision depends on a lot of project-specific factors, such as the expertise of the creator and their team, the types and velocity of data involved, the performance and scaling requirements, and much more. many other articles have debated about ORMs If you're interested in test data instead of (or in addition to) loading test data, please check back later this week for a follow-up article that explores generating test data for PostgreSQL using SQL, PL/pgSQL, and Python! generating Follow Along with Docker Want to follow along? I've collected sample data and scripts in a subfolder of our Tangram Vision blog repo: https://gitlab.com/tangram-vision-oss/tangram-visions-blog/-/tree/main/2021.04.28_LoadingTestDataIntoPostgreSQL As described in the repo's README, you can run examples using the with: official Postgres Docker image docker run --name=postgres --rm --env=POSTGRES_PASSWORD=foo \
    --volume=$(pwd)/schema.sql:/docker-entrypoint-initdb.d/schema.sql \
    --volume=$(pwd):/repo
    postgres:latest -c log_statement=all # The base postgres image requires a password to be set, but we'll just be # testing locally, so no need to set a strong password. To explain this Docker command a bit: The base postgres image requires a password to be set (via the environment variable), but we'll just be testing locally, so no need to set a strong password. POSTGRES_PASSWORD Executable scripts ( and files) in the folder inside the container will be executed as PostgreSQL starts up. The above command mounts into that folder, so the database tables will be created. *.sh *.sql /docker-entrypoint-initdb.d schema.sql The repo is also mounted to inside the container, so example SQL and CSV files are accessible. /repo The PostgreSQL server is started with the config override, which increases the logging verbosity. log_statement=all The repo contains a variety of files that start with which demonstrate different ways of loading and generating test data. After the Postgres Docker container is running, you can run files in a new terminal window with a command like: add-data- add-data- docker exec --workdir= /repo postgres \
    psql --host=localhost --username=postgres \
         --file=add-data-sql-copy-csv.sql If you want to interactively poke around the database with , use: psql docker exec --interactive --tty postgres \
    psql --host=localhost --username=postgres Sample Schema For example code and data, I'll use the following simple schema: Musical artists have a name An artist can have many albums (one-to-many), which have a title and release date Genres have a name Albums can belong to many genres (many-to-many) Sample schema relating musical artists, albums, and genres. Loading Static Data The simplest way to get test data into PostgreSQL is to make a static dataset, which you can save as CSV files or embed in SQL files directly. SQL COPY from CSV Files In the , there are 4 small CSV files, one for each table of the sample schema. The CSV files contain headers and data rows as shown in the image below. code repo accompanying this blogpost A small, static sample dataset of musical artists, albums, and genres. We can import the data from these CSV files into a PostgreSQL database with the command: SQL COPY -- Excerpt add-data-copy-csv.sql the sample code repo
COPY artists FROM CSV HEADER;
COPY albums FROM CSV HEADER;
COPY genres FROM CSV HEADER;
COPY album_genres FROM CSV HEADER from in '/repo/artists.csv' '/repo/albums.csv' '/repo/genres.csv' '/repo/album_genres.csv' The COPY command has a variety of options for controlling quoting, delimiters, escape characters, and more. You can even limit which rows are imported with a WHERE clause. One potential downside is you must run it as a database superuser or as a user with permissions to read and write and execute files on the server — this isn't a concern when loading data for local testing, but keep it in mind if you ever want to use it in a more restrictive or production-like environment. Psql Copy from CSV Files The PostgreSQL interactive terminal (called psql) provides a that is very similar to SQL COPY: copy command -- Excerpt add-data-copy-csv.psql the sample code repo
\copy artists csv header
\copy albums csv header
\copy genres csv header
\copy album_genres csv header from in from 'artists.csv' from 'albums.csv' from 'genres.csv' from 'album_genres.csv' There are some important differences between SQL COPY and psql copy: Like other psql commands, the psql version of the copy command starts with a backslash ( ) and doesn't need to end with a semicolon ( ). \ ; SQL COPY runs in the server environment whereas psql copy runs in the client environment. To clarify, the filepath you provide to SQL COPY should point to a file on the server's filesystem. The filepath you provide to psql copy points to a file on the filesystem where you're running the psql client. If you're following along using the Docker image and commands provided in this blogpost, the server and client are the same container, but if you ever want to load data from your local machine to a database on a remote server, then you'll want to use psql copy. As a corollary to the above, psql copy is less performant than SQL COPY, because all the data must travel from the client to the server, rather than being directly loaded by the server. SQL COPY requires absolute filepaths, but psql can handle relative filepaths. Psql copy runs with the privileges of the user you're connecting to the server as, so it doesn't require superuser or local file read/write/execute permissions like SQL COPY does. Putting Data in SQL Directly As an alternative to storing data in separate CSV files (which are loaded with SQL or psql commands), you can store data in SQL files directly. SQL COPY from stdin and pg_dump The SQL COPY and psql copy commands can load data from stdin instead of a file. They will parse and load all the lines between the copy command and as rows of data. \. -- Excerpt add-data-copy-stdin.sql the sample code repo
COPY public.artists (artist_id, name) FROM stdin CSV; , , , , , , , \.

COPY public.albums (album_id, artist_id, title, released) FROM stdin CSV; , , , , , , , , , , , , , , , , , , , , , \.

... from in 1 "DJ Okawari" 2 "Steely Dan" 3 "Missy Elliott" 4 "TWRP" 5 "Donald Fagen" 6 "La Luz" 7 "Ella Fitzgerald" 1 1 "Mirror" 2009 -06 -24 2 2 "Pretzel Logic" 1974 -02 -20 3 3 "Under Construction" 2002 -11 -12 4 4 "Return to Wherever" 2019 -07 -11 5 5 "The Nightfly" 1982 -10 -01 6 6 "It's Alive" 2013 -10 -15 7 7 "Pure Ella" 1994 -02 -15 In fact, this approach is how outputs data if you're creating a dump or backup from an existing PostgreSQL database. However, uses a tab-separated format by default, rather than the comma-separated format shown above. COPY ... FROM stdin [pg_dump](<https://www.postgresql.org/docs/current/app-pgdump.html>) pg_dump By default, also outputs SQL to re-create everything about the database (tables, constraints, views, functions, reset sequences, etc.), but you can instruct it to output only data with the flag. To try out with the example Docker image, run: pg_dump --data-only pg_dump docker exec --workdir= /repo postgres \
    pg_dump --host=localhost --username=postgres postgres SQL INSERTs Another way to put data directly in SQL is to use . This approach could look like the following: INSERT statements -- Excerpt add-data-insert- -ids.sql the sample code repo
INSERT INTO artists (artist_id, name)
OVERRIDING SYSTEM VALUE
VALUES
  ( , ),
  ( , ),
  ( , ),
  ( , ),
  ( , ),
  ( , ),
  ( , );

INSERT INTO albums (album_id, artist_id, title, released)
OVERRIDING SYSTEM VALUE
VALUES
  ( , , , ),
  ( , , , ),
  ( , , , ),
  ( , , , ),
  ( , , , ),
  ( , , , ),
  ( , , , );

... from static in 1 'DJ Okawari' 2 'Steely Dan' 3 'Missy Elliott' 4 'TWRP' 5 'Donald Fagen' 6 'La Luz' 7 'Ella Fitzgerald' 1 1 'Mirror' '2009-06-24' 2 2 'Pretzel Logic' '1974-02-20' 3 3 'Under Construction' '2002-11-12' 4 4 'Return to Wherever' '2019-07-11' 5 5 'The Nightfly' '1982-10-01' 6 6 'It' 's Alive' '2013-10-15' 7 7 'Pure Ella' '1994-02-15' The clause lets us INSERT values into the primary key ID columns explicitly even though they are defined as . OVERRIDING SYSTEM VALUE GENERATED ALWAYS The command's option will output data as INSERT statements (a separate statement per row), rather than as the default TSV format. Using INSERTs instead of COPY will run much slower when restoring the data, so this is only recommended if you're restoring the data to a database that doesn't support COPY, such as sqlite3. Using INSERTs can be sped up somewhat with the option, allowing you to INSERT many rows at a time per command, reducing the overhead of back-and-forth communication between client and server for every SQL statement. pg_dump --column-inserts --rows-per-insert Using INSERT statements, we could start moving away from statically declaring everything about our datasets — we could omit the primary key ID columns and lookup IDs as needed when inserting foreign keys, as in the following example: -- Excerpt add-data-insert-queried-ids.sql the sample code repo
INSERT INTO artists (name)
VALUES
  ( ),
  ( ),
  ( ),
  ( ),
  ( ),
  ( ),
  ( );

INSERT INTO albums (artist_id, title, released)
VALUES
  ((SELECT id FROM artists WHERE name = ), , ),
  ((SELECT id FROM artists WHERE name = ), , ),
  ((SELECT id FROM artists WHERE name = ), , ),
  ((SELECT id FROM artists WHERE name = ), , ),
  ((SELECT id FROM artists WHERE name = ), , ),
  ((SELECT id FROM artists WHERE name = ), , ),
  ((SELECT id FROM artists WHERE name = ), , );

... from in 'DJ Okawari' 'Steely Dan' 'Missy Elliott' 'TWRP' 'Donald Fagen' 'La Luz' 'Ella Fitzgerald' 'DJ Okawari' 'Mirror' '2009-06-24' 'Steely Dan' 'Pretzel Logic' '1974-02-20' 'Missy Elliott' 'Under Construction' '2002-11-12' 'TWRP' 'Return to Wherever' '2019-07-11' 'Donald Fagen' 'The Nightfly' '1982-10-01' 'La Luz' 'It' 's Alive' '2013-10-15' 'Ella Fitzgerald' 'Pure Ella' '1994-02-15' This is hardly convenient, though, because we need to duplicate other row information (such as the artist name) in order to look up the corresponding ID. It gets even more complex if multiple artists have the same name! So, if you have a static dataset I'd suggest sticking to one of the previously mentioned approaches that use SQL COPY or psql copy. Putting Data in CSVs vs in SQL Files Is there a reason to prefer putting static datasets in CSVs or directly in SQL files? My thoughts boil down to the following points: CSVs are a widely understood and supported format (just make sure to be clear and consistent with encoding!). If your datasets will be maintained or created by people who prefer spreadsheet programs to database-admin and command-line tools, CSVs may be preferable. If you want to keep all your test data and database setup in one place, SQL files are a convenient way to do that. If your testing or continuous integration processes use pg_dump or its output, then you're already using datasets embedded in an SQL file — keep doing what makes sense for you! I hope you learned something new and useful about the different approaches and tools available for loading static datasets into PostgreSQL. Check back soon for the follow-up article that explores test data for PostgreSQL using SQL, PL/pgSQL, and Python! generating If you have any suggestions or corrections, please let me know or send us a tweet , and if you’re curious to learn more about how we improve perception sensors, visit us at Tangram Vision . Previously published at https://www.tangramvision.com/blog/loading-test-data-into-postgresql

Velocity

How To Create and Load Test Data in PostgreSQL

About Author

Comments

TOPICS

THIS ARTICLE WAS FEATURED IN

Related Stories

Untitled Story

3 Common Types of 3D Sensors: Stereo, Structured Light, and ToF

10 Ways to Optimize Your Database

10 Ways to Reduce Data Loss and Potential Downtime Of Your Database

10 Principles of Proper Database Benchmarking

10 Minute Guide to Fixing Damaged SQL Databases - No Recovery Required!

3 Common Types of 3D Sensors: Stereo, Structured Light, and ToF

10 Ways to Optimize Your Database

10 Ways to Reduce Data Loss and Potential Downtime Of Your Database

10 Principles of Proper Database Benchmarking

10 Minute Guide to Fixing Damaged SQL Databases - No Recovery Required!

Light-Mode

Classic

Newspaper

Dark-Mode

Neon Noir

Minty

HN StartUps