White dot for spacing only
The Dice Project

replication.html,v 1.17 2022/01/27 17:55:31 gdutton Exp

Replication on DICE PostgreSQL Servers


Since version 9.0, PostgreSQL has supported various forms of replication and as of 9.2 a complete hot-standby system can be configured. This guide covers a few basic recipes on PostgreSQL 9.2 using lcfg-postgresql 4.0 or newer. It was written for DICE SL6 where the postgresql-9.2-server.h header is assumed.

This guide presently covers streaming replication but without log shipping. It is assumed that a consistent backup will be made available using the streaming replication mechanism. However, log shipping (with or without a base backup from which to begin) is supported by the postgresql component as of 4.0, and there should be no obstacle to its configuration in LCFG.

Summary Usage Instructions

Full configuration instructons follow. But in practice, those wishing to replicate may simply wish to use the pre-configured replication headers, thus:

On the master

#include <dice/options/postgresql-replication-master.h>
PSQL_REPLICATION_STANDBY(standby1,<standby 1 ip>)
PSQL_REPLICATION_STANDBY(standby2,<standby 2 ip>)
PG_OPTION(max_wal_senders,3) /* no. replias + 1 */

Then, on each standby:

#define PSQL_MASTER_HOSTNAME <master>.inf.ed.ac.uk
#include <dice/options/postgresql-replication-standby.h>

Restart postgresql component on all machines (master first) to activate. You will need to destroy the database (by physically removing the dbdata directory.

Configuration In Detail

Configuring the Master

Configuring the master server requires just two main changes; adding and authorising a user account and configuring replication itself.

Adding the replication user.

This is straightforward and can additionally make use of the pgluser component to automate the addition and removal of a PostgreSQL role with appropriate permissions. For example:

/* Assuming a pgluser "user" capability looking something like the following... */
pgluser.select_user            SELECT rolname FROM pg_roles WHERE rolinherit;
pgluser.create_user            CREATE ROLE "%(role)s"; ALTER ROLE "%(role)s" LOGIN;
pgluser.drop_user              REASSIGN OWNED BY "%(role)s" TO postgres; DROP ROLE IF EXISTS "%(role)s";

/* Users can be authorised for replication with a configuration like this... */
!pgluser.caps                  mADD(repl)
pgluser.implies_repl           user
pgluser.comment_repl           Grants replication privileges to a user.
pgluser.select_repl            SELECT rolname FROM pg_roles WHERE rolreplication;
pgluser.create_repl            ALTER ROLE "%(role)s" REPLICATION;
pgluser.drop_repl              ALTER ROLE "%(role)s" NOREPLICATION;

Using the above configuration, creating a new user on the database with replication permission is as simple as:

!pgluser.users_repl            mADD(replicator)

This user will course require connection privileges, and more often than not this will be from a keytab-based server-to-server connection. So, assuming a replication server with name standby.inf.ed.ac.uk:

!postgresql.hostssl            mADD(standby1)
postgresql.ssldatabase_standby1  replication
postgresql.ssluser_standby1      replicator
postgresql.ssladdress_standby1   129.215.x.y
postgresql.sslauthtype_standby1  krb5
postgresql.sslauthopt_standby1   map=repl

!postgresql.ident              mADD(standby1)
postgresql.idmap_standby1        repl
postgresql.sysname_standby1      pgstandby/standby.inf.ed.ac.uk
postgresql.pgname_standby1       replicator

This grants "pgstandby" (likely a principal owned by the postgres user on the standby) full replication capabilities. Normally you'd expect a macro to perform most of the latter configuration in one step.

Configuring Replication

How best to configure replication depends on your circumstances and it is best to read the copious documentation the Postgresql website before deciding on a course of action. However it can be as simple as:

!postgresql.options                   mADD(max_wal_senders wal_level)
/* How many simultaneous connections to allow.  
 * You will need one spare if you want to use basebackup -X stream */
postgresql.option_max_wal_senders     2
/* Writes sufficient data into WAL segments to allow hot standby service */
postgresql.option_wal_level           'hot_standby'

If you'd also like to make replication synchronous by default (this can always be altered on a per-command basis):

/* Set to the "appname" of a connecting server if you wish to set a preference,
 * otherwise "*" to make the first connecting standby a synchronous one. */
!postgresql.options                   mADD(synchronous_standby_names)
postgresql.synchronous_standby_names  'standby1'

Setting up the standby

Noting in the configuration of the master the requirent of Kerberos to connect, we set up a job to renew connectiong credentials.

!kerberos.keys              mADD(pgstandby)
kerberos.keytab_pgstandby     /etc/pgstandby.keytab
kerberos.keytabuid_pgstandby  <%postgresql.pgowner%>
kerberos.keytabgid_pgstandby  <%postgresql.pggroup%>

postgresql.krb5ccname       /path/to/pgstandby.k5cache
!cron.additions             mADD(pgstandby)
cron.add_pgstandby            @reboot /usr/bin/k5start -b -L -f <%kerberos.keytab_pgstandby%> -U \
    -g <%postgresql.pggroup%> -o <%postgresql.pgowner%> \
    -K 10 -k <%postgresql.krb5ccname%> -p /var/run/k5start.pgstandby.pid

Remember that standby servers should have as similar a configuration as possible to the master. Mismatches in options (even where unrelated to data, such as max_connections) will cause the standby to fail to start. Beyond any regular postgresql configuration, you will require the following.

!postgresql.recopts                 mADD(standby_mode primary_conninfo)
postgresql.recopt_standby_mode      'on'
postgresql.recopt_primary_conninfo  'host=master.inf.ed.ac.uk user=replicator'

If you wish to name the server specially, for example to mark it out for synchronous replication, simply set the application_name parameter in the primary_conninfo resource above.

Adding the above invokes the creation of the recovery.conf file, which marks the server as being a recovery server. On top of the above we can now configure additional features:

/* Enable hot standby, allowing us serve queries */
!postgresql.options                 mADD(hot_standby)
postgresql.option_hot_standby       'on'

/* Specify a trigger file whose presence should cause streaming replication to fail over */
!postgresql.recopts                 mADD(trigger_file)
postgresql.recopt_trigger_file      '/path/to/trigger/file'

 * For log shipping, you will also need... 
 * !postgresql.recopts              mADD(restore_command archive_cleanup_command)
 * postgresql.recopt_restore_command            /* see PostgreSQL docs */
 * postresql.recopt_archive_cleanup_command     /* see PostgreSQL docs */

Last but not least we can alter the default initialisation mode of the component. This option allows us to avoid a manual reload on first connection to the master server.

postgresql.initmode         basebackup
postgresql.basebkargs       -X stream -h standby.inf.ed.ac.uk -U replicator

Replication can now be started. The changes above require the master server to be fully restarted:

master$ om postgresql restart

And on the standby server. However, if the standby server has a data directory already, it should be moved aside so that the component can start afresh with the basebackup command.

standby$ om postgresql stop       # if required
[OK] postgresql: stop
standby$ qxprof postgresql.pgdata
standby$ mv /disk/data/db{,_old}
standby$ om postgresql start
[INFO] postgresql: Initializing database (from basebackup)...

Monitoring for replication failure

Replication failure is easy to spot but not presently automated. On the the master server, you should see all connected replicating servers:

=> SELECT application_name, usename, state, sync_state FROM pg_stat_replication;
 application_name | usename |   state   | sync_state 
 host_name        | repl    | streaming | async
 (1 row)

Absence of a row matching the appropriate standby hostname (in the 'application_name' column) indicates loss of a standby. On the standby itself, expect to see errors logged, but note that failure to synchronise will not otherwise disable the server. Most commonly with a streaming-only server, this would be because the master is too far ahead of the standby:

FATAL:  could not receive data from WAL stream: FATAL:  requested WAL segment 000000010000000D00000021 has already been removed

The simplest solution is of course just to stop the postgresql component on the standby server, move the old data aside and start the component again to restore from a fresh basebackup.

 : Units : Research_and_teaching : Documentation : Postgresql 

Mini Informatics Logo - Link to Main Informatics Page
Please contact us with any comments or corrections.
Unless explicitly stated otherwise, all material is copyright The University of Edinburgh
Spacing Line