White dot for spacing only
The Dice Project


Introduction

As part of this project, it was necessary to investigate the existing replication technologies available for use with OpenLDAP to see how well they suited our requirements.

To begin with, I will briefly describe our current replication set-up. We have one master LDAP server, where all updates are made. All other DICE machines are essentially acting as replicas - they all run slapd, with a full copy of the database. They are kept in sync with the master server using a locally-written program - slaprepl(8) (ldaprepl(8) under FC3). Its basic operation is well summarised by this part of the man page...

"It works as follows. Given that we know the latest modification date of the local directory, we query a remote directory for all entries modified after that time (picking up new entries and updated entries), then merge them with the local directory. This handles the cases of new or updated entries, but not deleted entries. To handle deletion we obtain a list of all of the entries (but not their attributes or values) in both directories, then delete all those that are in the local directory, but not in the remote one. As deletion is a relatively slow operation, its use is optional."

This is controlled via cron, with the basic replication occuring hourly (at a random time within the hour) and the full replication (i.e. including deletes) occuring daily. The main problem with this technique is the load it places on the server, with every client connecting hourly. In addition, it is perhaps not ideal, in resource or management terms, to run a full copy of slapd on every DICE machine.

Investigations

We wanted to investigate alternatives to this method, with a particular model in mind. From the "OpenLDAP Replication and Server Configuration" project case statement...

"The tentative initial plan on how our server/client set-up should look is to have one master server, with a slave at each site - these slaves would be kept up to date with the master using syncrepl replication. Each individual client would then run a proxy-caching LDAP server, falling back to the relevant site-slave."

OpenLDAP currently offers three different replication options - slurpd, syncrepl and delta-syncrepl (an extension to syncrepl). A good overview of the pros and cons of each method is given here

It is worth clarifying the following sentence from this document...

"Syncrepl is the newest form of replication. It allows replica servers to connect to the master to pull changes."

Although it is true that the replica always initiates connection with the master, there are two different methods of syncrepl - refreshOnly (pull-based) and refreshAndPersist (push-based). In the latter method, once the initial connection has been made, the server will push out changes to replica(s).

I didn't investigate the oldest OpenLDAP replication method - slurpd - in this project. Its use has been deprecated in favour of syncrepl.

To test replication technologies, I used a small pool of DICE machines - typically 3 or 4 at any one time. In most cases these were set-up as one master and two replicas.

Note that in syncrepl, the master is referred to as the provider and the replica referred to as the consumer. I will use these terms interchangable with master and replica.

syncrepl requires an identity to bind to the server with. All DICE machines already use an existing ldaprep/hostname kerberos identity for replication, so this was ideal for use with syncrepl.

syncrepl

Initially I looked at the pull (refreshOnly) method of syncrepl. It took me some time to come up with a working server/replica configuration. The documentation in this area is a little thin, but does seem to be improving as syncrepl is more widely used. It should also be noted that the OpenLDAP mailing lists are widely used and the OpenLDAP developers are frequent contributors.

My investigations into refreshOnly syncrepl weren't entirely successful. Once I had replicas successfully syncing with the master (connecting every 10 minutes), I found that they would, at some point, fall out of sync in a fairly drastic way - i.e. deleting most of the entries in the local database, leaving the machine unusable. At the time, I was still fairly inexperienced in dealing with slapd configuration, particularly with respect to syncrepl, so it's possible that mis-configuration may have been to blame. I didn't spend too much time debugging these problems with refreshOnly syncrepl, as it seemed clear to me that the refreshAndPersist operation was a more attractive proposition. A basic summary of the differences between the two is:

Both methods of syncrepl support replica catch-up. For example, starting a replica with an empty database would lead to it fully replicating from the master. This process took approximately 18 minutes for our full database, so it's very unlikely we would use it in practice (we would instead load the database using slapadd, as we do currently when installing a client).

To test refreshAndPersist syncrepl, I used the same set-up as when testing refreshOnly - one master and two replicas. I set the master up to replicate (using slaprepl) from our site LDAP master regularly (every 5 minutes) in order that the master/replicas had regular updates. This seemed to work well - as soon as the master received updates, it pushed these out to the replicas. I also ran a script daily that dumped the databases from master and replicas and compared them (using ldifdiff.pl)

One issue that arose when testing syncrepl was that of credential expiry. The ldaprep/hostname principals have a default lifetime of 10 hours and replication would stop working after this period, even if the credentials were renewed. This is because sasl-gssapi offers no way to re-establish the security context without establishing a new connection. This basically means that slapd has to be restarted for replication to use the new/renewed credentials. ćInitally I was restarting slapd three times daily (with a 10-hour principal). I later increased the lifetime of the ldaprep principals to be 24 hours, meaning that slapd needs to be restarted once daily on the replicas (I'm currently doing this at midnight). It would be possible to increase the lifetime of the principals (e.g. to a week) to cut down the frequency of slapd restarts. There is some debate as to whether the behaviour of MIT kerberos or OpenLDAP is in error in this instance (we believe it to be the latter) and there is little chance of this issue being addressed in OpenLDAP (unless we do it ourselves). Restarting slapd on a daily basis on the replicas isn't a problem or cause for concern however.

In my testing, refreshAndPersist syncrepl operation has been reliable and dependable. It has kept replicas in sync for long periods of time (months) and has shown itself able to recover from the master disappearing suddenly (e.g. the AT power outage on 25/01).

delta-syncrepl

I also looked at the newest replication method available for OpenLDAP - delta-syncrepl - this is an extension to syncrepl, which uses the accesslog overlay to track changes on the master, requiring an additional database. Use of the accesslog overlay means that, for modifications, only the changes are pushed to the replica (rather than the full record, as is the case in syncrepl). In testing, I found delta-syncrepl to also be reliable and dependable.

I also ran some tests comparing the network traffic between a syncrepl master and replica and a delta-syncrepl master and replica. For these tests I captured traffic between the master and replica - these machines were doing nothing else but acting as ldap master/replica, so the only network traffice between the two was generated by ldap replication. The results were surprising, given that the way in which delta-syncrepl works would appear to minimise the amount of network traffic. In several two-hour tests, I found that delta-syncrepl updates generated approximately 70% of the traffic generated by syncrepl (in a typical example 46K vs 65K for a two-hour period where there were 98 database changes). The surprises came when I captured network traffic for longer periods - i.e for an 8hr period. I did this twice and on both occasions delta-syncrepl generated more traffic than syncrepl, mainly caused by large spikes in activity over short periods of time. This issue would definitely warrant further investigation if we decided to use delta-syncrepl, but given that this is currently unlikely, I don't think it's worth spending more time on this.

Conclusions

In conclusion, I would recommend using refreshAndPersist (i.e. push-mode) syncrepl for any master-slave replication. It worked reliably for me in testing, is lightweight in terms of resource usage on both master and replica, is highly configurable, in active development, reasonably-well documented and is well supported on the OpenLDAP mailing list. delta-syncrepl was also reliable in testing, but is not as well supported or documented as syncrepl. If we were processing large numbers of updates, it might be worth re-considering delta-syncrepl (although the network traffic oddities I outline above would require further investigation). However, our LDAP master typically receives between 50 and 100 changes per hour, which is fairly minimal and doesn't generate significant master-replica traffic for synchronization purposes.

For more information


Toby Blake, February 2007 (updated March 2008)


 : Units : Infrastructure : Projects 

Mini Informatics Logo - Link to Main Informatics Page
Please contact us with any comments or corrections.
Unless explicitly stated otherwise, all material is copyright The University of Edinburgh
Spacing Line