Temperatures have remained reasonably stable since we had some machines moved between B.01 and B.Z14, as reported last time. We will continue to monitor, and adjust the room loadings as required.
We intend to run tests to measure the actual maximum power requirements (and, therefore, maximum heat production) of all recently-installed GPU servers (RT#98522). Armed with that information, we will liaise with Estates regarding useful changes or upgrades which we might make to the cooling systems in our server rooms.
In parallel, E&B will be making some water flow-rate measurements, likely later this week. We expect these to be non-intrusive (other than to lift some floor tiles), but will be on hand to monitor.
As a result of the continuing influx of power-hungry GPU servers, and the machine moves noted above that we have had to make to accomodate them, the Forum server rooms are now a seriously unpleasant — and indeed probably seriously unhealthy — place to work. The noise surveys which were done a while ago are now, of course, basically useless, and we're arranging to have these re-done. We'll then be able to use the data to make informed recommendations.
As those of us who were CSCS-certified for Bayes will recall, it's the employer's responsibility under H&S legislation to provide appropriate PPE. In the server rooms, ear-protection devices are provided for your use.
As noted in a recent meeting, we seem to be suffering a recurrence of junk etc. in our server rooms. Please:
Specifically to the RAT unit: please take action to cull your empty cardboard boxes in both the Forum and AT server rooms, as well as all the stuff on the shelving in the Forum server room.
The AT electrical tests slipped a bit from the schedule, though there were just a handful of issues on "our" floors:
The slippage was partly due to access issues, but mainly due to the poor state of the existing documentation and the discovery of circuits and boards that weren't expected but had to be tested. From that point of view alone, the EICR would appear to have been a Very Good Idea.
As a result the testing of the basement server room is expected to happen later this week. An "at risk" sys-announce message was sent out on Monday.
Stop press: an updated programme was issued this morning, and can be found here.
A first pass through all of these has been completed. There will be a need to revisit a couple of them to fix up some details which weren't done first time through.
The "new" structure of the wire and related "live" headers is as follows:
resolv.conffiles, and into the DHCP servers' subnet blocks. They have two sets of addresses, conditionally included, to allow for easy switching when nameservers are down for some reason, which is the main reason for adopting this structure rather than wiring the addresses in elsewhere.
We do change sites'
files independently of each other, so it does make sense to split them
apart to minimise rebuilds. However, generally if we're changing
router or nameserver for one wire at a site then we'll be changing it
for (almost) all wires, so there's no point in any further breakdown,
such as by managed/self-managed,
ipaddrs_*.h files are now obsolete and will
be removed in due course to minimise confusion.
It would make sense to add a recurrent deferred action to re-review these files annually.
We have an outline description document which is intended to summarise the way the new EdLAN is expected to look, and various "thinking" documents linked from this index page. Comments and questions are still invited.
Changes since last time: some comments noting that we currently assume that our JCMB DR site's off-site routing happens at KB, and that we need to maintain this arrangement or explicitly change it.
In particular, the "options" document now contains two recommendations:
We re-ran the routing tests involving removing the wire-A interface from one of our AT test routers. It does look as though the effects on the default routes that we noted last time are due to the inter-EdLAN-router path costs, which appear to be higher than our own inter-edge-router costs, though there is the nagging suspicion that the way IS inject the default route may also be having some effect. We have asked IS what they currently use (and are trying to work it out based on the OSPF behaviour) and what they propose for the new EdLAN.
If this is the cause, then we should be able to fix the problem by adjusting our own site base costs. Unfortunately this will have to be done in a number of steps, as the spread between the lowest and highest has to be less than 600 (as things are currently set up on our core routers), and even then might be at the mercy of EdLAN changes.
The alternative would be to hit all of our edge routers sumultaneously
(or at least as nearly so as lcfg allows), removing the wire-A (aka E160)
peerings from all of the AT routers and the AT2 (aka E42) peerings from
all of the Forum and KB routers. Although appearing more dramatic and
potentially glitch-provoking, it would actually be less error-prone to
implement as it would just involve putting
around three blocks of resources in the common routing definition file,
rather than making a series of changes both to lcfg headers and to the
switches' (manual) configurations.
(In the slightly less short term we need to come up with a plan for our JCMB DR site's routing.)
A brainstorming discussion on the potential ways we might connect up to the new EdLAN has been put down as a topic for discussion. We might alternatively prefer to schedule this for its own separate meeting.
We previously reported that we had had to enable dhcp-snooping for AT36 only for the ATL8 switches, as trying to enable it globally for the VLAN caused problems for the AV kit. One of the commercial tenants subsequently also had problems, so we have turned it off for ATL8 too now. This unfortunately leaves us vulnerable again to rogue DHCP servers on the wire, which was the reason we had enabled it in the first place. It appears that the switches don't like something which IS's DHCP servers are sending. We have reported this to IS, as it's likely that dhcp-snooping would be one of the things to be enabled on the new EdLAN.
As mentioned previously, we have been adding support for the letsencrypt 'dns-01' challenge to the x509 component. Some information has been added to the X509LetsEncrypt wiki page.
We have also added support to lcfg-x509 for obtaining letsencrypt certificates with subject additional names.
The letsencrypt documentation in the man page lcfg-x509(8) has been improved.
lcfg-x509-0.1.7-1 went through the stable release a couple of weeks ago. We'd appreciate any feedback.
(Prompted partly by Dave G's October 23rd response to Graeme, as
forwarded to the
cos list by Alastair...)
Since we turned on IPv6 for the managed wires there has been a steady background trickle of folk asking when it'll be available for the self-managed wires too. We have held off so far for a couple of reasons:
Given that there are now ISPs who offer first-class IPv6 alongside their (NATted) IPv4 service, it seems that the first of these might not be an issue in practice. We are therefore minded to enable IPv6 on the "DHCP" subnet, in the first instance to see how things turn out. All being well, we would then extend it to the SM164 and SM197 subnets, after warning the punters to check their access controls.
(The "DHCP" subnet is routed by our Linux routers. As the routing has to be symmetric for the benefit of the iptables connection-tracking, we would arrange to send RAs only from the machine which was designated as the wire router at the time. The SMxxx wires are (currently) routed by our switches, which then speak OSPF to the edge routers, so ECMP on the wire wouldn't be a problem there.)
Blog articles and emails would be produced, of course.
Reminder: the documents we produced as part of the IPv6 project are here.
Does anyone still make use of our "env" sites (Forum, AT, KB)? Going by the access logs, it doesn't look as though there's any real use made of them at all. Would anyone mind if we closed the firewall holes? The full netmon sites would of course still be available over the VPN, or to those with registered static IP addresses (IPv4 or IPv6).
Reminder: because we host a RIPE atlas anchor and a probe, we are clocking up lots of credits. To date we haven't done much at all with them. It might be interesting to do so, such as to measure accessibility of our sites as seen by the other probes around the world. Suggestions welcome.
And if anyone would like some credits, to try the system out, please give us a shout!
A new version of prometheus (0.99.46-1) - both client and server - has been released. Please let us know if you experience any problems which may be related to this release.
inf-unit-report.html,v 1.66 2019/11/13 09:48:46 gdmr Exp
Please contact us with any comments or corrections.
Unless explicitly stated otherwise, all material is copyright The University of Edinburgh