White dot for spacing only
The Dice Project


(Inf logo) Operational Meeting Infrastructure Unit Report -- 23rd November 2016

  1. AT basement power

    We were apparently too quick in assuming that the AT basement power issues were fixed. Paul Hutton writes: "Apologies for the mass mailing. During the recent work on the backup electrical power system at the Appleton Tower datacentre, a further problem was identified with the electrical switching components.

    "This fault means that, while the everything will behave correctly in the event of a power failure, there remains a problem when power is restored. This would leave the datacentre running on battery power until the switchover back to mains could be carried out manually.

    "Estates have identified the faulty component and ordered a replacement. However, replacing it will require a complete power down of the datacentre, which is likely to last around 4 hours.

    "The intention is to carry out the work on a Sunday, between now and Christmas. At this stage, it would be helpful to know of any weekends where your division or department considers that the work could not take place.

    "I'll update with more information when it becomes available and, if it's helpful, am happy to arrange a meeting to discuss arrangements once more detail can be provided."

    Alastair has set up a doodle poll for availability to cover the outage.

  2. Core switch firmware

    A new version of the problematic 54xx firmware has appeared on HPE's site, but according to the release notes it doesn't appear to fix our issue.

  3. KB power

    There was a power cut at KB on Wednesday 16th. When power came back the rack0 UPS had some battery problems. As we didn't have a change-date logged for it, a new one has been delivered, swapped in and "recalibrated".

    Duncan Herd writes: "As you are all very aware, power was lost across most of the site yesterday afternoon. The issue that caused this event is now known -- it was related to the Combined Heat and Power (CHP) plant that we have on campus.

    "I have been asked to collate any issues or concerns that you have following yesterday's power outage. Undoubtedly there are some issues that would have been impossible to do anything about, but there may be some issues that could be avoided if the power goes again at some point in the future."

    Comments welcome! (We have already noted the lack of UPS provision for the new CSR PoP switch stack.)

    Following up on that, Duncan Herd also writes: "As you will be aware there was a major power fault least week. The cause of this fault has been identified as an issue with the Combined Heat and Power (CHP) plant. It is now PLANNED to bring the CHP plant back on stream on THURSDAY 24^th November at 6am. There should be no issue with bringing this back on stream but it is advisable the we plan for a 'just in case'. Building managers have already been made aware of the plan. In the unlikely event of the power going off a team will be on campus to immediately switch it back on."

    Note that our own local UPS provision is really only intended to allow machines to shut down cleanly. On-battery run-time is only on the order of 10-15 minutes. Note that the UPSes are set up to signal low-battery a few minutes before it's expected to run out, and will not restore power until they have recharged "enough". Specifically, they're now all set to: low-battery 5 minutes, minimum capacity after shutdown 15%. (This wasn't the case at the power cut, as the UPSes had been recycled from previous use elsewhere and hadn't had their parameters adjusted.)

    We have also now fixed up the live/jcmb-server-room.h header to allow for direct monitoring of rack UPSes (as opposed to the indirect monitoring used in the Forum, where there are just too many machines to allow them all to poll the UPSes directly). To use this, #define one of JCMB_RACK0, JCMB_RACK1, JCMB_RACK2 or JCMB_RACK3 as appropriate, and then include the live header.

  4. Sequencing power-ons

    A reminder that the "switched" power bars have the ability to set turn-on delays individually for their outlets. By default outlets are set to turn on randomly between 60 and 80 seconds after power is restored. However, each "outlet" line can take an optional fourth parameter, which can take one of the following values:

    (It's also possible to set outlets to be normally-off but to turn on when power is restored. See fpdu/s2A.outlets for an example. There probably aren't many applications for this...)

  5. JCMB water leak

    There was another water leak upstairs from the JCMB CSR on Tuesday 15th, but this time it doesn't appear to have come through.

    Donald Grigor writes: "As follow up and clarification post yesterdays water leak alert I'd like to confirm the following: The water leak was not in the room with the autoclave but somewhat west of that location and very unlikely to involve the CSR at all. The leak was attended to quickly by E&B and arrested before significant damage had occurred. Subsequent permanent repairs will be conducted in a more orderly fashion and it is not likely to result in any water escape that will affect the CSR. Separate visual checks by both myself and Stuart Taylor of Computational Biology have confirmed there has (to date) been no ingress of water from this incident. Many thanks."

    Ian also checked and found everything clear.

  6. JCMB CSR PoP switches

    The new CSR PoP switches will be brought into service on Tuesday 29th. The necessary components for our own connection are being procured, and once they arrrive we'll arrange to move our own links across. As noted last time, the intention is to connect one 10GbaseSR bridged link direct from cs0 (with a hot-spare port in cs1) to the new PoP stack, and two or three routed 1000baseT links from the network servers, also to the new PoP stack. We'll arrange that the links don't all terminate on the same PoP switch, for resilience. This will provide continued routing of .216 in the event that any one of our own switches or the PoP switches going down, without requiring someone to go out to KB as at present to move the hot-spare link across to our other switch. (In practice, the only time we have actually lost connectivity is when rebooting a switch for a firmware upgrade, but as we have all the necessary components for the additional routed links it makes sense to add them for our own convenience.)

  7. Server reboots

    norrington (AT network services) and rattle (Forum network services) were the two remaining user-visible machines still to be rebooted for their kernel updates. The were done on Tuesday 15th and Wednesday 16th respectively at 13:00.

    norrington was afflicted with the console-output problem, whereby hitting return would result in a few more characters of rsyslog output being printed rather than the expected login or password prompt. The attempt to reboot proceeded very slowly, with nothing much apparently happening after stopping the first component (ntp). However, using om from a connected shell to stop individual components appeared to work normally. Another reboot command from that shell resulted in an immediate reboot, though not a clean one as fsck did recover a couple of journal files and cleared an orphaned inode. Thereafter things did appear to proceed as normal, with the usual second reboot for initramfs and then the machine coming up fully.

    rattle rebooted cleanly.

  8. Central firewall

    We previously reported a few issues which were traced back to the University's central firewall. See RT#77441 and RT#77317 for details. IS have now responded. The email Alastair forwarded to cos on Thursday 17th contains the full details. In summary:

  9. wire_*.h headers

    A pass has been made through all of these on paper to identify the necessary changes, based on the list from last time. These will be implemented over the next few weeks, bearing in mind that changes to some will provoke large profile rebuilds.

  10. addrwatch

    addrwatch data are now being incorporated by the address-search tool, as used by the various netmon address search boxes and the nightly DHCP no-lease reports.

    One interesting point this has brought out is that while we do have mechanisms in place to prevent machines which are connected to a "wrong" VLAN from obtaining an IPv4 address, the same is not true for IPv6 addresses. As we are including prefix and DNS information in the RAs on our DICE wires, machines which connect to those will be able to construct IPv6 addresses which will work perfectly well for many purposes. If we consider this to be something which needs to be fixed, the options would seem to be enforcing DHCPv6 and corresponding ND restrictions, in a similar manner to the ARP protection we apply to some IPv4 subnets; or else 802.1X everywhere. For further study.

  11. OpenVPN

    Version 2.4 has moved from alpha to beta1. There are sufficiently many useful new things in this version that we'll be rolling it out to the DEV (and perhaps DR) endpoints soon. In particular, IPv6 support is now pretty much up to the same level as IPv4.

  12. SL7 iptables

    There's currently a quirk in the SL7 iptables setup whereby the very first Configure() that's normally done on SL6 by a separate boot-time script doesn't get done, so the rules don't get generated. The workaround is just to do the

       om iptables configure
    
    by hand. Once the rules are actually there, the scripts will do the right thing.
  13. Console servers

    As mentioned in our report to the previous meeting we've definitely got a memory leak somewhere in our console server setup.

    Looking at ifconsoles.inf, the leak is connected with the serial consoles for KVM guests: the associated virsh processes (or subprocesses spawned by those) appear to be responsible.

    In total, we are leaking memory at a peak rate of about 0.25GB/hr, which is 6GB/day, or 42GB/week - so we have configured total swap space of 48GB on ifconsoles.inf. That, along with the weekly restart of conserver, should keep things working.

    If we can find a better fix, we'll implement it.

  14. SL 7 servers

    We have brought the following SL7 servers into service:


 : Operational : Meetings 

Mini Informatics Logo - Link to Main Informatics Page
Please contact us with any comments or corrections.
Unless explicitly stated otherwise, all material is copyright The University of Edinburgh
Spacing Line