White dot for spacing only
The Dice Project

(Inf logo) Operational Meeting Infrastructure Unit Report
13th November 2019

  1. Forum server room cooling

    Temperatures have remained reasonably stable since we had some machines moved between B.01 and B.Z14, as reported last time. We will continue to monitor, and adjust the room loadings as required.

    We intend to run tests to measure the actual maximum power requirements (and, therefore, maximum heat production) of all recently-installed GPU servers (RT#98522). Armed with that information, we will liaise with Estates regarding useful changes or upgrades which we might make to the cooling systems in our server rooms.

    In parallel, E&B will be making some water flow-rate measurements, likely later this week. We expect these to be non-intrusive (other than to lift some floor tiles), but will be on hand to monitor.

  2. Forum server room noise

    As a result of the continuing influx of power-hungry GPU servers, and the machine moves noted above that we have had to make to accomodate them, the Forum server rooms are now a seriously unpleasant — and indeed probably seriously unhealthy — place to work. The noise surveys which were done a while ago are now, of course, basically useless, and we're arranging to have these re-done. We'll then be able to use the data to make informed recommendations.

    As those of us who were CSCS-certified for Bayes will recall, it's the employer's responsibility under H&S legislation to provide appropriate PPE. In the server rooms, ear-protection devices are provided for your use.

  3. Junk etc. in our server rooms

    As noted in a recent meeting, we seem to be suffering a recurrence of junk etc. in our server rooms. Please:

    Specifically to the RAT unit: please take action to cull your empty cardboard boxes in both the Forum and AT server rooms, as well as all the stuff on the shelving in the Forum server room.

  4. AT electrical testing

    The AT electrical tests slipped a bit from the schedule, though there were just a handful of issues on "our" floors:

    The slippage was partly due to access issues, but mainly due to the poor state of the existing documentation and the discovery of circuits and boards that weren't expected but had to be tested. From that point of view alone, the EICR would appear to have been a Very Good Idea.

    As a result the testing of the basement server room is expected to happen later this week. An "at risk" sys-announce message was sent out on Monday.

    Stop press: an updated programme was issued this morning, and can be found here.

  5. Wire and ipaddrs headers

    A first pass through all of these has been completed. There will be a need to revisit a couple of them to fix up some details which weren't done first time through.

    The "new" structure of the wire and related "live" headers is as follows:

    These continue to contain (only, ideally) those resources which have to be set in a particular way for machines on the subnet to operate properly. In most cases, only the netmask for non-/24 subnets really matters in practice for most machines. These then conditionally include one or other or both of the following..
    These site-specific files contain the addresses of nameservers which would be reasonable to configure into machines' resolv.conf files, and into the DHCP servers' subnet blocks. They have two sets of addresses, conditionally included, to allow for easy switching when nameservers are down for some reason, which is the main reason for adopting this structure rather than wiring the addresses in elsewhere.
    These site-specific files contain the addresses of routers which would be reasonable to configure into machines' profiles where DHCP or RA/RDISC aren't available for some reason. They're also used to configure the DHCP servers' subnet blocks. Again, there are two sets of addresses, conditionally enabled to allow for easy switching as required.

    We do change sites' nameserver_ and router_ files independently of each other, so it does make sense to split them apart to minimise rebuilds. However, generally if we're changing router or nameserver for one wire at a site then we'll be changing it for (almost) all wires, so there's no point in any further breakdown, such as by managed/self-managed,

    The ipaddrs_*.h files are now obsolete and will be removed in due course to minimise confusion.

    It would make sense to add a recurrent deferred action to re-review these files annually.

  6. EdLAN

    We have an outline description document which is intended to summarise the way the new EdLAN is expected to look, and various "thinking" documents linked from this index page. Comments and questions are still invited.

    Changes since last time: some comments noting that we currently assume that our JCMB DR site's off-site routing happens at KB, and that we need to maintain this arrangement or explicitly change it.

    In particular, the "options" document now contains two recommendations:

    1. That we connect to the new EdLAN in "ISP" mode, at least initially, as we simply don't have time to do the necessary development work for any other way to be practicable; and
    2. We continue to test and evaluate the "colourless ports" model and the EdLAN management tools, and develop against them, with a view to a potential move to the "buy-in" model at some future date.

    We re-ran the routing tests involving removing the wire-A interface from one of our AT test routers. It does look as though the effects on the default routes that we noted last time are due to the inter-EdLAN-router path costs, which appear to be higher than our own inter-edge-router costs, though there is the nagging suspicion that the way IS inject the default route may also be having some effect. We have asked IS what they currently use (and are trying to work it out based on the OSPF behaviour) and what they propose for the new EdLAN.

    If this is the cause, then we should be able to fix the problem by adjusting our own site base costs. Unfortunately this will have to be done in a number of steps, as the spread between the lowest and highest has to be less than 600 (as things are currently set up on our core routers), and even then might be at the mercy of EdLAN changes.

    The alternative would be to hit all of our edge routers sumultaneously (or at least as nearly so as lcfg allows), removing the wire-A (aka E160) peerings from all of the AT routers and the AT2 (aka E42) peerings from all of the Forum and KB routers. Although appearing more dramatic and potentially glitch-provoking, it would actually be less error-prone to implement as it would just involve putting #ifdef tests around three blocks of resources in the common routing definition file, rather than making a series of changes both to lcfg headers and to the switches' (manual) configurations.

    (In the slightly less short term we need to come up with a plan for our JCMB DR site's routing.)

    A brainstorming discussion on the potential ways we might connect up to the new EdLAN has been put down as a topic for discussion. We might alternatively prefer to schedule this for its own separate meeting.

  7. dhcp-snooping on AT36

    We previously reported that we had had to enable dhcp-snooping for AT36 only for the ATL8 switches, as trying to enable it globally for the VLAN caused problems for the AV kit. One of the commercial tenants subsequently also had problems, so we have turned it off for ATL8 too now. This unfortunately leaves us vulnerable again to rogue DHCP servers on the wire, which was the reason we had enabled it in the first place. It appears that the switches don't like something which IS's DHCP servers are sending. We have reported this to IS, as it's likely that dhcp-snooping would be one of the things to be enabled on the new EdLAN.

  8. lcfg-x509/letsencrypt changes

    As mentioned previously, we have been adding support for the letsencrypt 'dns-01' challenge to the x509 component. Some information has been added to the X509LetsEncrypt wiki page.

    We have also added support to lcfg-x509 for obtaining letsencrypt certificates with subject additional names.

    The letsencrypt documentation in the man page lcfg-x509(8) has been improved.

    lcfg-x509-0.1.7-1 went through the stable release a couple of weeks ago. We'd appreciate any feedback.

  9. IPv6 for the self-managed wires

    (Prompted partly by Dave G's October 23rd response to Graeme, as forwarded to the cos list by Alastair...)

    Since we turned on IPv6 for the managed wires there has been a steady background trickle of folk asking when it'll be available for the self-managed wires too. We have held off so far for a couple of reasons:

    1. We don't yet have DHCPv6 in place, and following suggestions from Sam that some devices might not follow the RA settings, we've been wary about things not working properly.
    2. We don't want to open IPv6 firewall holes to machines on SM164 and SM197 which might not be ready for them.

    Given that there are now ISPs who offer first-class IPv6 alongside their (NATted) IPv4 service, it seems that the first of these might not be an issue in practice. We are therefore minded to enable IPv6 on the "DHCP" subnet, in the first instance to see how things turn out. All being well, we would then extend it to the SM164 and SM197 subnets, after warning the punters to check their access controls.

    (The "DHCP" subnet is routed by our Linux routers. As the routing has to be symmetric for the benefit of the iptables connection-tracking, we would arrange to send RAs only from the machine which was designated as the wire router at the time. The SMxxx wires are (currently) routed by our switches, which then speak OSPF to the edge routers, so ECMP on the wire wouldn't be a problem there.)

    Blog articles and emails would be produced, of course.

    Reminder: the documents we produced as part of the IPv6 project are here.

  10. "env" sites

    Does anyone still make use of our "env" sites (Forum, AT, KB)? Going by the access logs, it doesn't look as though there's any real use made of them at all. Would anyone mind if we closed the firewall holes? The full netmon sites would of course still be available over the VPN, or to those with registered static IP addresses (IPv4 or IPv6).

  11. RIPE atlas

    Reminder: because we host a RIPE atlas anchor and a probe, we are clocking up lots of credits. To date we haven't done much at all with them. It might be interesting to do so, such as to measure accessibility of our sites as seen by the other probes around the world. Suggestions welcome.

    And if anyone would like some credits, to try the system out, please give us a shout!

  12. Prometheus update

    A new version of prometheus (0.99.46-1) - both client and server - has been released. Please let us know if you experience any problems which may be related to this release.

inf-unit-report.html,v 1.66 2019/11/13 09:48:46 gdmr Exp

 : Operational : Meetings 

Mini Informatics Logo - Link to Main Informatics Page
Please contact us with any comments or corrections.
Unless explicitly stated otherwise, all material is copyright The University of Edinburgh
Spacing Line