We were apparently too quick in assuming that the AT basement power issues were fixed. Paul Hutton writes: "Apologies for the mass mailing. During the recent work on the backup electrical power system at the Appleton Tower datacentre, a further problem was identified with the electrical switching components.
"This fault means that, while the everything will behave correctly in the event of a power failure, there remains a problem when power is restored. This would leave the datacentre running on battery power until the switchover back to mains could be carried out manually.
"Estates have identified the faulty component and ordered a replacement. However, replacing it will require a complete power down of the datacentre, which is likely to last around 4 hours.
"The intention is to carry out the work on a Sunday, between now and Christmas. At this stage, it would be helpful to know of any weekends where your division or department considers that the work could not take place.
"I'll update with more information when it becomes available and, if it's helpful, am happy to arrange a meeting to discuss arrangements once more detail can be provided."
Alastair has set up a doodle poll for availability to cover the outage.
A new version of the problematic 54xx firmware has appeared on HPE's site, but according to the release notes it doesn't appear to fix our issue.
There was a power cut at KB on Wednesday 16th. When power came back the rack0 UPS had some battery problems. As we didn't have a change-date logged for it, a new one has been delivered, swapped in and "recalibrated".
Duncan Herd writes: "As you are all very aware, power was lost across most of the site yesterday afternoon. The issue that caused this event is now known -- it was related to the Combined Heat and Power (CHP) plant that we have on campus.
"I have been asked to collate any issues or concerns that you have following yesterday's power outage. Undoubtedly there are some issues that would have been impossible to do anything about, but there may be some issues that could be avoided if the power goes again at some point in the future."
Comments welcome! (We have already noted the lack of UPS provision for the new CSR PoP switch stack.)
Following up on that, Duncan Herd also writes: "As you will be aware there was a major power fault least week. The cause of this fault has been identified as an issue with the Combined Heat and Power (CHP) plant. It is now PLANNED to bring the CHP plant back on stream on THURSDAY 24^th November at 6am. There should be no issue with bringing this back on stream but it is advisable the we plan for a 'just in case'. Building managers have already been made aware of the plan. In the unlikely event of the power going off a team will be on campus to immediately switch it back on."
Note that our own local UPS provision is really only intended to allow machines to shut down cleanly. On-battery run-time is only on the order of 10-15 minutes. Note that the UPSes are set up to signal low-battery a few minutes before it's expected to run out, and will not restore power until they have recharged "enough". Specifically, they're now all set to: low-battery 5 minutes, minimum capacity after shutdown 15%. (This wasn't the case at the power cut, as the UPSes had been recycled from previous use elsewhere and hadn't had their parameters adjusted.)
We have also now fixed up the
to allow for direct monitoring of rack UPSes (as opposed to the indirect
monitoring used in the Forum, where there are just too many machines to
allow them all to poll the UPSes directly). To use this,
#define one of
as appropriate, and then include the live header.
A reminder that the "switched" power bars have the ability to set turn-on delays individually for their outlets. By default outlets are set to turn on randomly between 60 and 80 seconds after power is restored. However, each "outlet" line can take an optional fourth parameter, which can take one of the following values:
middle" is the default: a random time from 60 to 80 seconds from power on
immediate" randomly selects either 0 or 1 second from power-on
last" randomly selects a time from 180 to 240 seconds after power-on
(It's also possible to set outlets to be normally-off but to turn on
when power is restored. See
fpdu/s2A.outlets for an example.
There probably aren't many applications for this...)
There was another water leak upstairs from the JCMB CSR on Tuesday 15th, but this time it doesn't appear to have come through.
Donald Grigor writes: "As follow up and clarification post yesterdays water leak alert I'd like to confirm the following: The water leak was not in the room with the autoclave but somewhat west of that location and very unlikely to involve the CSR at all. The leak was attended to quickly by E&B and arrested before significant damage had occurred. Subsequent permanent repairs will be conducted in a more orderly fashion and it is not likely to result in any water escape that will affect the CSR. Separate visual checks by both myself and Stuart Taylor of Computational Biology have confirmed there has (to date) been no ingress of water from this incident. Many thanks."
Ian also checked and found everything clear.
The new CSR PoP switches will be brought into service on
The necessary components for our own connection are
procured, and once they arrrive we'll arrange to move our own links
across. As noted last time,
the intention is to connect one 10GbaseSR bridged link direct from
cs0 (with a hot-spare port in
cs1) to the new
PoP stack, and two or three routed 1000baseT links from the network
servers, also to the new PoP stack. We'll arrange that the links don't
all terminate on the same PoP switch, for resilience. This will provide
continued routing of .216 in the event that any one of our own switches
or the PoP switches going down, without requiring someone to go out to KB
as at present to move the hot-spare link across to our other switch.
(In practice, the only time we have actually lost connectivity is when
rebooting a switch for a firmware upgrade, but as we have all the necessary
components for the additional routed links it makes sense to add them for
our own convenience.)
norrington (AT network services) and
(Forum network services) were the two remaining user-visible machines still
to be rebooted for their kernel updates. The were done on Tuesday 15th
and Wednesday 16th respectively at 13:00.
norrington was afflicted with the console-output problem,
return would result in a few more characters
of rsyslog output being printed rather than the expected login or password
prompt. The attempt to
reboot proceeded very slowly, with
nothing much apparently happening after stopping the first component
ntp). However, using
om from a connected shell
to stop individual components appeared to work normally. Another
reboot command from that shell
resulted in an immediate reboot, though not a clean one as
did recover a couple of journal files and cleared an orphaned inode.
Thereafter things did appear to proceed as normal, with the usual second
initramfs and then the machine coming up fully.
rattle rebooted cleanly.
We previously reported a few issues which were traced back to the University's central firewall. See RT#77441 and RT#77317 for details. IS have now responded. The email Alastair forwarded to cos on Thursday 17th contains the full details. In summary:
A pass has been made through all of these on paper to identify the necessary changes, based on the list from last time. These will be implemented over the next few weeks, bearing in mind that changes to some will provoke large profile rebuilds.
addrwatch data are now being incorporated by the address-search tool, as used by the various netmon address search boxes and the nightly DHCP no-lease reports.
One interesting point this has brought out is that while we do have mechanisms in place to prevent machines which are connected to a "wrong" VLAN from obtaining an IPv4 address, the same is not true for IPv6 addresses. As we are including prefix and DNS information in the RAs on our DICE wires, machines which connect to those will be able to construct IPv6 addresses which will work perfectly well for many purposes. If we consider this to be something which needs to be fixed, the options would seem to be enforcing DHCPv6 and corresponding ND restrictions, in a similar manner to the ARP protection we apply to some IPv4 subnets; or else 802.1X everywhere. For further study.
Version 2.4 has moved from alpha to beta1. There are sufficiently many useful new things in this version that we'll be rolling it out to the DEV (and perhaps DR) endpoints soon. In particular, IPv6 support is now pretty much up to the same level as IPv4.
There's currently a quirk in the SL7 iptables setup whereby the very first Configure() that's normally done on SL6 by a separate boot-time script doesn't get done, so the rules don't get generated. The workaround is just to do the
om iptables configureby hand. Once the rules are actually there, the scripts will do the right thing.
As mentioned in our report to the previous meeting we've definitely got a memory leak somewhere in our console server setup.
ifconsoles.inf, the leak is connected with the serial
consoles for KVM guests: the associated
virsh processes (or
subprocesses spawned by those) appear to be responsible.
In total, we are leaking memory at a peak rate of about 0.25GB/hr,
which is 6GB/day, or 42GB/week - so we have configured total swap space of
ifconsoles.inf. That, along with the weekly restart
of conserver, should keep things working.
If we can find a better fix, we'll implement it.
We have brought the following SL7 servers into service:
Please contact us with any comments or corrections.
Unless explicitly stated otherwise, all material is copyright The University of Edinburgh