Setting up Continuous PostgreSQL Backups

This manual describes the process of setting up continuous backups for PostgreSQL databases.

Our company (business to business) provides each client with a server, on which the PostgreSQL database and our software are installed. Thus, we do not have a single production instance but dozens of them with different database instances. The process of setting up backups is part of the production installation process, and the backup itself begins before the system goes into production and continues throughout the entire life cycle of cooperation with the client.

We determine the specification of hardware and underlying software, so all instances, as a rule, have the same versions of Linux and PostgreSQL. Occasionally, this invariant may be violated. For example, for one reason or another, it may not be Ubuntu, but Debian or PostgreSQL of an older major version. In the latter case, you need to be especially careful – in the event of a failure, you should keep in mind that the database should be restored to the same major version of PostgreSQL which was used for the backup. The fact is that the described approach requires binary compatibility of data files, which is guaranteed only when switching between minor PostgreSQL versions. What to do if this invariant is violated is also described at the end of this article.

It is implied that the reader may have very little knowledge of both PostgreSQL backups and working with Linux (and in particular with its command-line tools), so the tutorial is written in as much detail as possible.

Basic Concept of Continuous Archiving and Point-in-Time Recovery

In this section, we outline the main ideas behind the Continuous Archiving and Point-in-Time Recovery approaches. See the PostgreSQL documentation for details if necessary.

Perhaps just like any DBMS, PostgreSQL has data files in which it stores the current state of the database. However, PostgreSQL also maintains and stores changelogs in the database. These logs are in the form of so-called write-ahead logs (WAL) files, which are stored in the pg_wal subfolder of the data directory.

WAL files are used by PostgreSQL for failure protection. Simply put, when committing a transaction, PostgreSQL makes sure that it is the changes in the corresponding WAL file that are guaranteed to be saved to disk, but generally speaking, it may not perform the same check for data files, for example, when caching their changes until a certain point in the operational memory. If after the last guaranteed save of the data files to disk (checkpoint) there is a failure and the current state has not yet been saved, then after PostgreSQL is back online, it will take the last data file checkpoint (let’s call it base backup) and sequentially apply the changes saved in the WAL files to it (replay log entries).

The model described above can also be used for backup with subsequent recovery (for example) on another server. To do this, it is required to implement the following scheme:

First (once), backup the data files (and send the files to a safe place on another server)
In the process of work, as WAL files appear, also backup them

Thus, the failure recovery model is also clear:

On the new PostgreSQL server, load the data files from the backup (to put it simply, this corresponds to copying them to the appropriate working data folder)
On the new PostgreSQL server, load the WAL files from the backup (to put it simply, this corresponds to copying them to the appropriate subfolder of the working data folder)

It is worth bearing in mind the following features of the backup model in question:

All databases of the current PostgreSQL server are backed up at once, i.e. there is no way to backup, for example, only one of them
There is no need to take a base backup of all the WAL records – you can just stop at some point in time (point-in-time recovery). This feature allows you to restore the database even if the failure occurs due to a software error (for example, some table was deleted using the drop table command, etc.).

The following sections provide the technical details of the scheme described.

For this purpose, the following terms will be used: the target server is the current working server from which we make backups, the backup server is the server to which backups are sent, and the new server is the server on which the backup of the target server is restored after a failure.

Off-topic: a few words on security

Since backups are so valuable to businesses, both in terms of failure recovery and possible information leaks, it is worth noting the basic security of Linux servers. In general, this is not directly related to the topic under consideration. Therefore, you can just skim through this section and, if everything is the same for you, just move on. If your security solutions are better than those described, please share them in the comments and tell us about your experience. If you think that your security is worse, then the solutions described below should probably be applied to your productions as soon as possible. We will consider Ubuntu 18.04. For other versions of Linux, the instructions may differ.

1. Your system is up to date

sudo apt-get update && sudo apt-get upgrade

2. You do not work as a “root” user, you have a personalized user instead (i.e. each employee who needs access to the server has a separate user) in the sudo group (in our case, let’s call it alex)

sudo adduser alex
sudo adduser alex sudo
sudo su - alex
sudo ls -la /root
ls -la /root

3. When connecting via SSH, you do not use login/password authorization, using access by key instead

ssh-keygen -t rsa -b 4096
nano ~/alex.pub
mkdir -p ~/.ssh
touch ~/.ssh/authorized_keys
chmod 700 ~/.ssh
chmod 600 ~/.ssh/authorized_keys
cat ~/alex.pub >> ~/.ssh/authorized_keys
rm ~/alex.pub
sudo reboot

4. In the ssh config, the access by key is enabled. What’s more, the ssh access port is changed from standard to random ([generatedPortNumber] macro substitution in the script below) in the range from 1024 to 57256 (because often port scanners only check standard ports to save resources)

sudo nano /etc/ssh/sshd_config

PubkeyAuthentication yes
AuthenticationMethods publickey
Port [generatedPortNumber]

sudo reboot

5. Root access via ssh is denied, password access via ssh is denied

sudo nano /etc/ssh/sshd_config

PermitRootLogin no
ChallengeResponseAuthentication no
PasswordAuthentication no
UsePAM no

sudo reboot

Setting up a backup server

Creating backup folders

Since we have many installations for backups, the folder structure will be as follows. The folder for storing backups is called /var/lib/postgresql/backups. For each installation, a subfolder bearing the client’s name [clientName] is created in the backup folder. Each such subfolder will contain 2 folders: base for the base backup and wal for continuous archiving WAL files from the target server. Thus, when setting up a backup for the client [clientName], we execute the following commands to create the corresponding directories:

sudo mkdir -p /var/lib/postgresql/backups/[clientName]/base
sudo mkdir -p /var/lib/postgresql/backups/[clientName]/wal

Of course, the folder structure can be any other that you find more convenient for you. For example, if you store backups in the home directories of the corresponding users (see below), then in the next paragraph you can save a couple of commands that give read and write permissions to these folders.

Creating a user

For security purposes, each production (target server) that backups come from will have its user on the backup server. This is a normal user with limited permissions (not sudoer), which should have read and write permissions to its backup folder, but which has no read or write permissions to other production folders. Thus, even if this target server is compromised, this will not leak data from other productions through the backup server.

The process in more detail is as follows. Let’s say we’re setting up a backup for a project named foo. To configure the backup server, we initially go to it as a user from sudo. Then add a limited user foobackup:

# Добавить нового пользователя
sudo adduser foobackup
Adding user 'foobackup' ...
Adding new group 'foobackup' (1001) ...
Adding new user 'foobackup' (1001) with group 'foobackup' ...
Creating home directory '/home/foobackup' ...
Copying files from '/etc/skel' ...
Enter new UNIX password:
Retype new UNIX password:
passwd: password updated successfully
Changing the user information for foobackup
Enter the new value, or press ENTER for the default
        Full Name []: AlexGtG
        Room Number []:
        Work Phone []:
        Home Phone []:
        Other []:
Is the information correct? [Y/n] Y

Granting backup folder permissions to the user

We give foobackup read and write access to its folder, while others are deprived of these permissions:

sudo chown -R foobackup: /var/lib/postgresql/backups/foo
sudo chmod -R 700 /var/lib/postgresql/backups/foo

Setting up public keys

To complete the setup of the backup server, our new user foobackup needs to be given remote access so that it can send backups to the server. As before, access will be carried out by access keys and only by them.

On the target server, backups are made on behalf of the postgres user, so we generate (if this has not been done before) the private and public keys for it:

sudo su - postgres
ssh-keygen -t rsa -b 4096

Generating public/private rsa key pair.
Enter file in which to save the key (/var/lib/postgresql/.ssh/id_rsa):
/var/lib/postgresql/.ssh/id_rsa already exists.
Overwrite (y/n)? y
Enter passphrase (empty for no passphrase):
Enter same passphrase again:
Your identification has been saved in /var/lib/postgresql/.ssh/id_rsa.
Your public key has been saved in /var/lib/postgresql/.ssh/id_rsa.pub.

Copy the contents of the public key from the location specified by the system during generation (Your public key has been saved in...). In our case, it is /var/lib/postgresql/.ssh/id_rsa.pub:

vi /var/lib/postgresql/.ssh/id_rsa.pub

Save the public key to the backup server for the foobackup user:

sudo su - foobackup
nano ~/foobackup.pub
mkdir -p ~/.ssh
touch ~/.ssh/authorized_keys
chmod 700 ~/.ssh
chmod 600 ~/.ssh/authorized_keys
cat ~/foobackup.pub >> ~/.ssh/authorized_keys
rm ~/foobackup.pub

Checking if the copy works

Now let’s check if everything has been done correctly. In our scheme, scp will be responsible for copying backups from the target server to the backup server. It is a utility for copying files to a remote server with a syntax similar to the local counterpart – cp. In general, you can use other means of file delivery, such as rsync, etc.

scp /var/lib/postgresql/backups/test.txt foobackup@[backupServerIp]:/var/lib/postgresql/backups/foo/base
scp /var/lib/postgresql/backups/test.txt foobackup@[backupServerIp]:/var/lib/postgresql/backups/foo/wal

After copying, go to the backup server and make sure that the file has been copied, for example, like this:

test -f /var/lib/postgresql/backups/foo/base/test.txt && echo 'exists in base' || echo 'not exists in base'
test -f /var/lib/postgresql/backups/foo/wal/test.txt && echo 'exists in wal' || echo 'not exists in wal'

When the checks are completed, the test.txt file should be deleted from the corresponding folders of the backup server.

Target server setup

PostgreSQL configuration

Archiving without compression

In the postgresql.conf file (its location can be obtained by running the “SHOW config_file;” command from psql), make the following changes (substitute [backupServerIp] = IP address of the backup server, [clientName] = foo):

wal_level = replica
archive_mode = on
archive_command = 'cat %p | ssh foobackup@[backupServerIp] "set -e; test ! -f /var/lib/postgresql/backups/[clientName]/wal/%f; cat > /var/lib/postgresql/backups/[clientName]/wal/%f.part; sync /var/lib/postgresql/backups/[clientName]/wal/%f.part;  mv /var/lib/postgresql/backups/[clientName]/wal/%f.part /var/lib/postgresql/backups/[clientName]/wal/%f"'      
archive_timeout = 3600

In archive_command, the first thing is to check if a file with the same name already exists on the backup server. This is one of the requirements of the PostgreSQL documentation, which is aimed at protecting against the destruction of the backup integrity due to administrative errors – when, for example, backups from two different servers are set by mistake to the same folder. After that, the file is copied over the network (scp) using a fairly standard approach: first, the stream is written to a temporary file, and then, only if it is completely copied and flushed to disk (sync), its name (mv) changes from temporary (.part) to permanent. If an error occurs after renaming the file and the metadata about such renaming has not been flushed to disk, the script will return with an error and PostgreSQL will simply resend the file. An error at any step of the script will end the entire script with a non-zero code (set -e).

Each WAL file is 16MB and its archiving (in our case, “archiving” means sending backups to the server) occurs only after it is full. Thus, if a given client generates little database traffic, the backup of the current WAL file may not occur for an indefinitely long time. To be able to restore the version of the database in case of a failure, for example, not more than an hour old, you need to set the archive_timeout time for forced archiving (the interval after which even an incomplete WAL file is archived) to 1 hour – i.e. 3600 seconds. You should not set too small values since even incomplete WAL files take up 16MB – thus, in a given period, at least 16MB of data will be sent to the backup server. For example, with archive_timeout equal to one hour, at least 384 MB of data will be sent to the backup server per day. This is more than 2 GB per week.

Compressed archiving

Given the possible problems with the backup size growth, you can immediately compress and send already compressed WAL files to the backup server (we do this in our productions). In this case, archive_command will look like this:

archive_command = 'gzip -c -9 %p | ssh foobackup@[backupServerIp] "set -e; test ! -f /var/lib/postgresql/backups/[clientName]/wal/%f.gz; cat > /var/lib/postgresql/backups/[clientName]/wal/%f.gz.part; gzip -t /var/lib/postgresql/backups/[clientName]/wal/%f.gz.part; sync /var/lib/postgresql/backups/[clientName]/wal/%f.gz.part;  mv /var/lib/postgresql/backups/[clientName]/wal/%f.gz.part /var/lib/postgresql/backups/[clientName]/wal/%f.gz"'       # command to use to archive a logfile segment

After the changes made in postgresql.conf, you need to restart PostgreSQL:

sudo service postgresql restart

Creating a base backup

A base backup is created using the pg_basebackup utility that has been installed on the target server along with PostgreSQL. It is assumed that the PostgreSQL of the target server has a certain trusted user with sufficient privileges to backup all databases of the current PostgreSQL installation (for example, this can be an administrator account used to maintain databases or a separate user specially created to make base backups). This username must be used in the [trusted db user]. During the execution of the command, the password of this user will be requested. After creating a backup, immediately send it to the backup server.

sudo -i -u postgres
pg_basebackup --pgdata=/tmp/backups --format=tar --gzip --compress=9 --label=base_backup --host=127.0.0.1 --username=[trusted db user] --progress --verbose
base.tar.gz и pg_wal.tar.gz,
scp /tmp/backups/base.tar.gz /tmp/backups/pg_wal.tar.gz foobackup@[backupServerIp]:/var/lib/postgresql/backups/[clientName]/base
exit

The --progress and --verbose keys are optional and are used to visually monitor the backup creation process – if these keys are present, PostgreSQL brings some additional human-readable information to the console.

Backup recovery

Backup functionality check

Before putting our foo installation into production, we need to make sure that the configured backup system is working. The check should consist of two parts:

The base backup is sent to the backup server
The WAL files are sent to the backup server

Point 1 is checked as follows: on the backup server, go to the /var/lib/postgresql/backups/foo/base folder and make sure that it contains the base.tar.gz and pg_wal.tar.gz files.

To quickly – without a long time of monitoring the system – check point 2, let’s return to the PostgreSQL settings and change the archiving timeout (archive_timeout) to 60 seconds, and then restart PostgreSQL. Now, if there are changes in the database, WAL files will be archived at least once a minute. Then for some time (3-5 minutes), we will be generating changes in the database in any (safe) way – for example, we do it simply through our front, manually creating activity by test users.

At the same time, you need to monitor the /var/lib/postgresql/backups/foo/wal folder of the backup server, where a new file will appear about once a minute.

After checking, it is very important to return the archive_timeout value to the production value (depending on the client, we have a minimum of 1 hour, i.e. 3600, and a maximum of a day, i.e. 86400).

If you see that the files are not sent to the backup server, you can start investigating the problem by analyzing the PostgreSQL logs located here /var/log/postgresql. For example, if the private-public key pair has been configured incorrectly, you can see a similar entry in the postgresql-10-main.log file (the name of the log file depends on the version installed):

2019-09-02 15:48:52.503 UTC [12983] DETAIL:  The failed archive command was: scp pg_wal/00000001000000000000003B [fooBackup]@[serverBackupIp]:/var/lib/postgresql/backups/foo/wal/00000001000000000000003B
Host key verification failed.
lost connection

Restoring from a backup

Let us have a pre-arranged new server, which has PostgreSQL of the same major version as on the target server. Let’s also assume that we have previously copied the backup folder /var/lib/postgresql/backups/foo from the backup server to a new server along the same path.

The following steps describe the procedure for deploying this backup on a new server.

Let’s find out the storage path for the data files:

show data_directory;

Depending on the version of PostgreSQL, something like this will be displayed: /var/lib/postgresql/10/main

Delete all contents of the data folder:

sudo service postgresql stop
sudo rm -rf /var/lib/postgresql/10/main/*

Stop the PostgreSQL instance on the TARGET server:

If, after a failure, access to the target server is preserved, stop the PostgreSQL instance there to lose as little new data as possible that will not be included in the backup.

sudo service postgresql stop

Unpack the files from the backup to the PostgreSQL data folder:

Here and below we are working with a new server again.

sudo tar xvzf /var/lib/postgresql/backups/[clientName]/base/base.tar.gz -C /var/lib/postgresql/10/main
sudo tar xvzf /var/lib/postgresql/backups/[clientName]/base/pg_wal.tar.gz -C /var/lib/postgresql/10/main

The tar command, which is run from sudo, is expected to preserve the group and ownership of the unpacked files under the postgres user – this is important since PostgreSQL running as this user will then use them.

In the data folder, create a recovery config:

If there is no compression when archiving WAL files:

nano /var/lib/postgresql/10/main/recovery.conf
restore_command = 'cp /var/lib/postgresql/backups/[clientName]/wal/%f %p'

sudo chown postgres:postgres /var/lib/postgresql/10/main/recovery.conf
sudo chmod 600 /var/lib/postgresql/10/main/recovery.conf

sudo chown -R postgres:postgres /var/lib/postgresql/backups/[clientName]/wal
sudo chmod 700 /var/lib/postgresql/backups/[clientName]/wal
sudo chmod 600 /var/lib/postgresql/backups/[clientName]/wal/*

If there is compression when archiving WAL files, restore_command should look like this (everything else remains the same):

restore_command = 'gunzip -c /var/lib/postgresql/backups/[clientName]/wal/%f.gz > %p'

Start PostgreSQL:

sudo service postgresql start

Having found the restore config in the data folder, PostgreSQL enters the recovery mode and starts replaying the WAL files from the archive. After the recovery is finished, recovery.conf will be renamed to recovery.done (so, in the previous step it is very important to give the postgres user permission to modify the file). After this step, the PostgreSQL server is ready to go. If there is an indication that the database has not been restored or has been restored only to the level of the base backup, you can start investigating the problem by analyzing the PostgreSQL logs located here /var/log/postgresql. For example, if in the previous step the postgres user is not granted permissions to the folder containing backed-up WAL files, you can see a similar entry in the postgresql-10-main.log file (the name of the log file depends on the version installed):

2019-09-04 11:52:14.532 CEST [27216] LOG:  starting archive recovery
cp: cannot stat '/var/lib/postgresql/backups/foo/wal/0000000100000000000000A8': Permission denied

Migrating a PostgreSQL database between different major versions

In this paragraph, as promised at the beginning of the article, we will consider the case when you need to transfer the base from one major version to another. Since the main backup scenario for us is the approach described above, here we will assume that the target server is available and operational – that is the database must be transferred not because of a failure, but for other (organizational) reasons.

We will use the pg_dump or pg_dumpall utility. Both of them generate a set of SQL commands. The first is used to create a backup of a specific database, and the second is used for backing up the entire cluster. In the latter case, in addition to databases, the global objects of the cluster are also copied – for example, roles. After deploying the backup, this eliminates additional actions, such as creating missing roles, distributing privileges, etc.

Creating a dump

Execute on the target server (at this moment the consumers of its databases should be stopped and/or not active concerning the database):

sudo -i -u postgres
pg_dumpall -c | gzip -c > /tmp/backups/foo.dump.gz

Inserting a dump

Let’s assume that the dump made in the previous step is copied to a new server along the same path (/tmp/backups/foo.dump.gz).

On a new server:

sudo -i -u postgres
gunzip -c /tmp/backups/foo.dump.gz | psql -d postgres &> /tmp/backups/foo.restore.out

After the backup is restored, you need to view the recovery log. By default, psql runs with the ON_ERROR_STOP flag turned off, so if errors occur, the script does not stop but continues running.

Enabling services

After the dump is added, you need to:

Set up a new server backup
Switch the relevant services to the new database address and enable them for the user

After that, the migration can be considered complete.