This work is progressing; the expected end date is 29.8.14. Once it's finished we'll get moving again on the additional rack which was postponed last F/Y because of the building work.
For some time we have been aware that as things stand we can't put a second DHCP server on wire T (AT self-managed), as it has dynamically-leased addresses on it and multiple servers don't share state at all well. The original intention had been to phase in wire U (129.215.3/24) as a separate wire for dynamically allocated addresses, but an analysis of the arpwatch output shows that that would be wasteful: there have only been 80 dynamic addresses used in the last three months or so, and only 19 static addresses (and many of those are network infrastructure things).
Instead the plan now is as follows: split wire R (199) into two /25 subnets. Use one for static addresses and one for dynamic addresses. These have been set up in advance, so that Support can arrange to move ports and users one by one. The static-address half has been given multiple DHCP servers for robustness (gatti as well as otaka), while the dynamic-address half will have just one (otaka), as with T now. In the switch configurations the static-address wire is labelled "R" and the dynamic-address wire is labelled "ATDHCP".
When they're all migrated, we'll reclaim T for OpenVPN use (see below) alongside U.
(We do have multiple DHCP servers on other subnets where there are only statically-allocated addresses and so no state to need sharing, so that we don't have single-points-of-failure on those.)
Following Toby's experiments with OpenVPN and iOS, some changes in our OpenVPN configuration will be required so that we can make use of an in-line generated tls-auth key rather than our current freeform key. As these will be incompatible with existing user configurations, we'll have to set up parallel endpoints using the new configuration. We'll then give users some time to migrate to the new endpoints, before finally phasing out the current endpoints and reclaiming the address space. (Note that this will be on existing hardware, as several instances of the daemon can run in parallel using different configuration.)
We'll also take the chance to rebalance our address space allocation so as to reduce churn, as the address-pool data for both the AT and Forum endpoints show more than 124 individual password-authenticated users for each. The suggestion is to make each of these a /24 (T and U for AT and the Forum respectively), while at the same time reducing the x509-authenticated space to /26 subnets (taken out of the top half of E; the bottom half will be used for external users, firewalled but not trusted).
The DEV endpoint has already been converted to the new configuration, and the DR endpoint temporarily changed in an ad hoc way as only COs should be affected. Once this week's <stable> release goes out we'll be able to convert DR "properly", and set up the new-style Forum and AT-x509 configurations. The AT-pass configuration will have to wait for Wire T to be released from its current use (see above).
There'll be a blog article on all this in due course.
Unfortunately the changes for the above threw up a couple of buglets in the iptables scripts:
These have now been fixed. Apologies for any bogus nagios reports these may have caused. Unfortunately the emails which might have alerted us sooner were delayed by the said problems!
tycho had an occurrence of the DL180 bonding problem with an interface on its add-on network card on Tuesday 26th, reported by nagios at 07:16. ifdown/ifup fixed it, but not immediately: the link didn't come back up, as reported in /proc/net/bonding/bond0, until a few minutes after the ifup had been done. There's nothing obvious in the syslog from around the time, and the switch didn't see the link bounce then either. (There are a couple of link downs-and-ups from a little later, corresponding with the ifdown/ifup.)
Interestingly, though, the link did bounce earlier (at 06:38:52), and the corresponding entry in the syslog for that says:
2014-08-26T06:38:44.925603+01:00 tycho kernel: WARNING: at net/sched/sch_generic.c:261 dev_watchdog+0x26b/0x280() (Tainted: P --------------- ) 2014-08-26T06:38:44.925615+01:00 tycho kernel: Hardware name: ProLiant DL180 G6 2014-08-26T06:38:44.963312+01:00 tycho kernel: NETDEV WATCHDOG: eth2 (bnx2): transmit queue 0 timed out ... 2014-08-26T06:38:52.030266+01:00 tycho kernel: bnx2 0000:05:00.0: eth2: NIC Copper Link is Down 2014-08-26T06:38:52.120215+01:00 tycho kernel: bonding: bond0: link status definitely down for interface eth2, disabling it
So it appears that the link actually went down at 06:38, and there are link-down and link-up trap entries from the switch for then. It just wasn't reported to us by nagios until 07:16:18. See here for Ian's previous explanation.
(Comments: 1. That note refers to fibre channel, but the nagios effects as they relate to ethernet bonding are the same. 2. Since that note was written, the nagios configuration has been changed so that now all alerts are sent via both email and Jabber. The alert 'latency' (via either email, or Jabber) in cases like this is therefore now about 30 minutes. -- Ian D.)
The full syslog extract can be found here, and an entry has been added to the BondingProblems page.
Please contact us with any comments or corrections.
Unless explicitly stated otherwise, all material is copyright The University of Edinburgh