This is the final report for the Inf-unit's SL7 server upgrade project. Our initial SL7 work, sufficient to get the desktop machines up and running, is described here. See also our sl7rt master ticket.
Work on this project divided naturally by service area. These are described in the sections below. A total effort figure is given at the end. Due to the way the work was divided among the Unit members it wouldn't be entirely straightforward or accurate to attempt to apportion it to individual categories.
Note that Prometheus and nut had their own separate sub-projects (#377 and #375 respectively), both of which have already completed and been signed off.
This section covers all of the Unit's Linux network servers. Completing the upgrade of these was a rather drawn-out affair in elapsed-time terms, due to proceeding with caution and extensive soak-testing. Progress was reported through our Operational meeting reports. We had a few issues to contend with:
systemdis fiddly, complicated and opaque. We do now appear to have all the necessary things being started, but we're not confident that they're being done in the right order, and the whole edifice feels distinctly fragile. This was all much cleaner and clearer in init-land.
eth0corresponds with the host's
br0, if the VM uses that, and the rest of the
ethNinterfaces are in the order they are created for the VM. Fortunately this does appear to be stable across reboots. It does, however, mean that we have had to change our convention that a network machine's "base" interface should be on the cross-site "transit" subnet, wire-B, to make site bootstrapping easier, and we will have to bear that in mind when deciding whether a VM or a real machine is more appropriate for any particular use.
/disk/home/<whatever>partitions across the reinstall, as we have been able to do over the last several upgrade cycles. The backup/restore/pullup, though documented, added time to the process, and made things just that bit more error-prone. In addition, the various network servers are all partitioned slightly differently, according to their function and the number of spindles they have, and it took several install cycles for a few of them to achieve a satisfactory new setup.
Fortunately, most network services had been made to work as
part of our previous SL7 project (DNS and iptables, in particular, which was
a fair amount of effort booked to that project).
snort did require a bit of work to adapt them
systemd's way of doing things (and for OpenVPN we were
also waiting for upstream support to firm up), and there was a little
tweaking to the netmon web servers as a result of apacheconf changes, but
for the most part packages just built and worked within the framework
already developed, which was a considerable relief.
We started with the simplest and/or most expendable systems (development routers, test RADIUS servers), gradually worked up through more important systems (external nameservers, then routers and network-services servers), leaving the most critical (network infrastructure servers) to last.
Upgrading the console service was fairly straightforward. As usual we built
the most recent release of the
conserver package from source, rather
than using third-party RPMs. The upgrade also required the porting to SL7 of our
DHCP service - which obviously had to be done at some time anyway.
Some other general comments:
ipmitool) appear to be less stable than they were on SL6; there are apparently random segfaults.
virsh- we had hoped that that problem might have been fixed in SL7. The pragmatic fix is to configure huge amounts of swap, and to restart all consoles on a weekly basis.
We build and run our own version of Kerberos on the KDCs. This is
to avoid unpleasant upstream upgrades surprising us, as has happened
in the past. We had some issues in building packages for SL7,
specifically the difficulty in building i686 versions of
krb5-libs. It is likely that we will remove this from
the KDCs in the future.
As well as the kerberos packages themselves, we also had to build
krb5-strength and its dependencies for the KDC master.
At the time of the upgrade we were running a version of
krb5-strength with our own custom patches (to add
flexibility to specifications of character classes in the password
policy). Most of these patches are now in the upstream release, so we
should upgrade accordingly.
Other issues which affected the upgrade were the build-time decision made by kerberos on setting the default for the kerberos credentials cache. This was partly caused by an upstream, systemd-related, bug (which has since been fixed). It is something we need to be aware of for future upgrades.
We took advantage of the upgrade to SL7 to revamp, and simplify, the structure of the kerberos headers.
As we were planning the upgrade of the master KDC service (to a
newly purchased machine), the existing KDC (
developed a disk fault. This brought our plans forward somewhat, so
we converted one of the existing SL7 slaves (
the KDC master and then installed the new machine
maytals) as a slave. It is possible that we may reverse
these roles, but there is little need to do so.
As always, our TEST.INF.ED.AC.UK realm was invaluable in the upgrade process.
sixkts server upgrade to SL7 went smoothly - we
had already built and tested
sixkts on a test SL7
kx509 service has a very small number of users (6
in the last 3 months, 3 of them being COs). Porting the software from
SL5 to SL6 was non-trivial (see the "KX509" section of our
SL6 report) so there was a possibility we would cease to run this
service if similar problems arose this time. Fortunately the process
was smooth, so
kx509 lives on.
The most significant work in upgrading the LDAP servers to SL7 was a complete revamp of the openldap headers. These were originally written when all DICE machines ran their own LDAP servers and as such contained a lot of assumptions. In particular we needed to separate, as much as possible, client and server side configuration. As the structure of these headers was quite complex and intertwined, we had to be careful that we didn't break any existing machines (clients and servers) when making changes. We also had a requirement that machines should continue to be able to run (and use) their own local LDAP server (see 2.2. in the OpenLDAP: DICE client configuration project). The work on headers was tracked separately.
The upgrade to SL7 also involved replacing the hardware for the
LDAP master and site slaves. We were reminded that exam preparation
<live/ldap-server-hostnames.h> being kept
up to date.
We needed to consider IPv6 as part of this work. It generally just worked - ldapsearch prefers IPv6 over IPv4 where available, although sssd does not seem to use it. We should keep an eye on this situation, both from a client perspective and for server-side ACLs.
On upgrading the master to SL7 we discovered a bug in
cyrus-sasl which prevented some large maps being written
- this particularly affects Prometheus's population of Capabilities
and Netgroups maps. We have patched
to fix this. A (long overdue) new version should be released shortly. Prior to recent development we had concerns
over how actively maintained this software was.
Three "autofs" LDAP servers, which were set up as SL6, will be turned off when they are no longer required (by a significant number of machines), rather than upgraded, and their content will be merged into our main LDAP tree. The reason for having these is described elsewhere, and the timing for this final step is dependent on progress made by other Units in eliminating SL6, or at least accepting the risk of replication failures.
The merging of the
ou=AutofsMaps branch will be
combined with a wider revision of our LDAP tree structure, mainly in
order to remove parts that are no longer used.
The upgrade of Prometheus to SL7 was tracked separately.
Upgrading the Nagios service is always a large job: there are a lot of dependencies (not least, the 'translators' associated with every component), and the (complex!) framework code invariably needs amendment to cope with system changes. (One issue on this occasion concerned changes to our overall LDAP configuration, and the associated knock-on effects of that.)
In addition, the Nagios software itself went from v3.x to v4.x between SL6 and SL7; handling that required various changes.
The upgrade of the loghost actually coincided with its scheduled hardware replacement. The new machine was installed directly as SL7, and logging services then run in parallel on the old and new machines for a while until we were confident that they were behaving similarly on SL7 as on SL6. Finally, we stopped logging to the old SL6 server and turned it off (SL7RT#441).
On behalf of the User Support Unit, we arranged the port to SL7 of the Informatics Forum touchscreens system. The opportunity was taken to better modularize the design; specifically, to allow different web browsers to be easily dropped in. The system now uses Chrome, which has better/easier support for 'kiosk' mode than does Firefox.
The project was started in January 2016 and the final server was upgraded on 24th October 2017. The actual effort logged against this project was nearly 27 FTE weeks in total, or around 1/4 FTE averaged over the whole lifetime of the project.
For reference, the initial SL7 project took rather more than 7 FTE weeks of
effort, as a result of DNS and iptables issues, the Prometheus project took
around 6 weeks,
and the nut project 3 weeks (though not all of that latter's effort
was for SL7).
Please contact us with any comments or corrections.
Unless explicitly stated otherwise, all material is copyright The University of Edinburgh