White dot for spacing only
The Dice Project


361: Inf-unit SL7 server upgrade project — final report

This is the final report for the Inf-unit's SL7 server upgrade project. Our initial SL7 work, sufficient to get the desktop machines up and running, is described here. See also our sl7rt master ticket.

Work on this project divided naturally by service area. These are described in the sections below. A total effort figure is given at the end. Due to the way the work was divided among the Unit members it wouldn't be entirely straightforward or accurate to attempt to apportion it to individual categories.

Note that Prometheus and nut had their own separate sub-projects (#377 and #375 respectively), both of which have already completed and been signed off.

Network

Tracking meta-ticket.

This section covers all of the Unit's Linux network servers. Completing the upgrade of these was a rather drawn-out affair in elapsed-time terms, due to proceeding with caution and extensive soak-testing. Progress was reported through our Operational meeting reports. We had a few issues to contend with:

  1. systemd is fiddly, complicated and opaque. We do now appear to have all the necessary things being started, but we're not confident that they're being done in the right order, and the whole edifice feels distinctly fragile. This was all much cleaner and clearer in init-land.
  2. SL7 is even more bloated than SL6 was, so that what was a reasonable VM configuration for SL6 was inadequate for SL7. We ended up just doubling disc, memory and CPUs, having abortively tried to debug some early SL7 VM upgrades..
  3. It doesn't seem to be possible to set VMs' interfaces using MAC addresses any more. After much fiddling and experimenting it appears as though eth0 corresponds with the host's br0, if the VM uses that, and the rest of the ethN interfaces are in the order they are created for the VM. Fortunately this does appear to be stable across reboots. It does, however, mean that we have had to change our convention that a network machine's "base" interface should be on the cross-site "transit" subnet, wire-B, to make site bootstrapping easier, and we will have to bear that in mind when deciding whether a VM or a real machine is more appropriate for any particular use.
  4. We had to repartition everything, which meant we couldn't in general just preserve the /disk/home/<whatever> partitions across the reinstall, as we have been able to do over the last several upgrade cycles. The backup/restore/pullup, though documented, added time to the process, and made things just that bit more error-prone. In addition, the various network servers are all partitioned slightly differently, according to their function and the number of spindles they have, and it took several install cycles for a few of them to achieve a satisfactory new setup.

Fortunately, most network services had been made to work as part of our previous SL7 project (DNS and iptables, in particular, which was a fair amount of effort booked to that project). OpenVPN and snort did require a bit of work to adapt them to systemd's way of doing things (and for OpenVPN we were also waiting for upstream support to firm up), and there was a little tweaking to the netmon web servers as a result of apacheconf changes, but for the most part packages just built and worked within the framework already developed, which was a considerable relief.

We started with the simplest and/or most expendable systems (development routers, test RADIUS servers), gradually worked up through more important systems (external nameservers, then routers and network-services servers), leaving the most critical (network infrastructure servers) to last.

Consoles

Tracking meta-ticket.

Upgrading the console service was fairly straightforward. As usual we built the most recent release of the conserver package from source, rather than using third-party RPMs. The upgrade also required the porting to SL7 of our DHCP service - which obviously had to be done at some time anyway. Some other general comments:

  1. The console servers were some of the first of our multi-homed machines to be upgraded to SL7, and we suffered from odd networking problems where certain interfaces appeared to become unresponsive. We're no longer able to replicate these problems, so we assume that something elsewhere eventually 'got fixed'.
  2. We are not yet using a pure systemd approach for conserver control.
  3. IPMI consoles (implemented via ipmitool) appear to be less stable than they were on SL6; there are apparently random segfaults.
  4. In respect of consoles for KVM guests, we are still suffering from serious memory leaks from virsh - we had hoped that that problem might have been fixed in SL7. The pragmatic fix is to configure huge amounts of swap, and to restart all consoles on a weekly basis.

Authentication

Tracking meta-ticket.

Kerberos

We build and run our own version of Kerberos on the KDCs. This is to avoid unpleasant upstream upgrades surprising us, as has happened in the past. We had some issues in building packages for SL7, specifically the difficulty in building i686 versions of krb5-libs. It is likely that we will remove this from the KDCs in the future.

As well as the kerberos packages themselves, we also had to build krb5-strength and its dependencies for the KDC master. At the time of the upgrade we were running a version of krb5-strength with our own custom patches (to add flexibility to specifications of character classes in the password policy). Most of these patches are now in the upstream release, so we should upgrade accordingly.

Other issues which affected the upgrade were the build-time decision made by kerberos on setting the default for the kerberos credentials cache. This was partly caused by an upstream, systemd-related, bug (which has since been fixed). It is something we need to be aware of for future upgrades.

We took advantage of the upgrade to SL7 to revamp, and simplify, the structure of the kerberos headers.

As we were planning the upgrade of the master KDC service (to a newly purchased machine), the existing KDC (bevan) developed a disk fault. This brought our plans forward somewhat, so we converted one of the existing SL7 slaves (heaton) into the KDC master and then installed the new machine (maytals) as a slave. It is possible that we may reverse these roles, but there is little need to do so.

As always, our TEST.INF.ED.AC.UK realm was invaluable in the upgrade process.

SIXKTS

The sixkts server upgrade to SL7 went smoothly - we had already built and tested sixkts on a test SL7 machine.

KX509

Our kx509 service has a very small number of users (6 in the last 3 months, 3 of them being COs). Porting the software from SL5 to SL6 was non-trivial (see the "KX509" section of our SL6 report) so there was a possibility we would cease to run this service if similar problems arose this time. Fortunately the process was smooth, so kx509 lives on.

Directory services

Tracking meta-ticket.

The most significant work in upgrading the LDAP servers to SL7 was a complete revamp of the openldap headers. These were originally written when all DICE machines ran their own LDAP servers and as such contained a lot of assumptions. In particular we needed to separate, as much as possible, client and server side configuration. As the structure of these headers was quite complex and intertwined, we had to be careful that we didn't break any existing machines (clients and servers) when making changes. We also had a requirement that machines should continue to be able to run (and use) their own local LDAP server (see 2.2. in the OpenLDAP: DICE client configuration project). The work on headers was tracked separately.

The upgrade to SL7 also involved replacing the hardware for the LDAP master and site slaves. We were reminded that exam preparation relies on <live/ldap-server-hostnames.h> being kept up to date.

We needed to consider IPv6 as part of this work. It generally just worked - ldapsearch prefers IPv6 over IPv4 where available, although sssd does not seem to use it. We should keep an eye on this situation, both from a client perspective and for server-side ACLs.

On upgrading the master to SL7 we discovered a bug in cyrus-sasl which prevented some large maps being written - this particularly affects Prometheus's population of Capabilities and Netgroups maps. We have patched cyrus-sasl locally to fix this. A (long overdue) new version should be released shortly. Prior to recent development we had concerns over how actively maintained this software was.

Three "autofs" LDAP servers, which were set up as SL6, will be turned off when they are no longer required (by a significant number of machines), rather than upgraded, and their content will be merged into our main LDAP tree. The reason for having these is described elsewhere, and the timing for this final step is dependent on progress made by other Units in eliminating SL6, or at least accepting the risk of replication failures.

The merging of the ou=AutofsMaps branch will be combined with a wider revision of our LDAP tree structure, mainly in order to remove parts that are no longer used.

Account management

Tracking meta-ticket.

The upgrade of Prometheus to SL7 was tracked separately.

Monitoring (nagios)

Tracking meta-ticket.

Upgrading the Nagios service is always a large job: there are a lot of dependencies (not least, the 'translators' associated with every component), and the (complex!) framework code invariably needs amendment to cope with system changes. (One issue on this occasion concerned changes to our overall LDAP configuration, and the associated knock-on effects of that.)

In addition, the Nagios software itself went from v3.x to v4.x between SL6 and SL7; handling that required various changes.

Monitoring (UPSes)

The upgrade of nut to SL7 (and other non-SL7 nut work) was tracked separately.

LogHost

The upgrade of the loghost actually coincided with its scheduled hardware replacement. The new machine was installed directly as SL7, and logging services then run in parallel on the old and new machines for a while until we were confident that they were behaving similarly on SL7 as on SL6. Finally, we stopped logging to the old SL6 server and turned it off (SL7RT#441).

Informatics Forum touchscreens

Tracking meta-ticket.

On behalf of the User Support Unit, we arranged the port to SL7 of the Informatics Forum touchscreens system. The opportunity was taken to better modularize the design; specifically, to allow different web browsers to be easily dropped in. The system now uses Chrome, which has better/easier support for 'kiosk' mode than does Firefox.

Effort

The project was started in January 2016 and the final server was upgraded on 24th October 2017. The actual effort logged against this project was nearly 27 FTE weeks in total, or around 1/4 FTE averaged over the whole lifetime of the project.

For reference, the initial SL7 project took rather more than 7 FTE weeks of effort, as a result of DNS and iptables issues, the Prometheus project took around 6 weeks, and the nut project 3 weeks (though not all of that latter's effort was for SL7).


FinalReport.html,v 1.44 2017/11/08 14:49:18 gdmr Exp


 : Units : Infrastructure : Projects : 361-Inf-SL7-server-upgrade-umbrella 

Mini Informatics Logo - Link to Main Informatics Page
Please contact us with any comments or corrections.
Unless explicitly stated otherwise, all material is copyright The University of Edinburgh
Spacing Line