E&B write: "I wanted to make you aware that there will be one further shutdown of the mains cooling from 7am on Thursday morning. As before, this is required to allow the new cooling pipework to be connected into the existing network.
"Arrangements are in place to run the standby chiller for the Appleton Tower datacentre to cover this period and therefore this email is for information only.
"The mains cooling should be reinstated later on in the day on Thursday."
Dave H is querying the details, particularly as the Forum would be affected. More as we know it...
There was a Forum power-down on Saturday April 14th. We shut down some
systems (particularly those with master data)
in advance, but left most of the network
infrastructure to look after itself. This mostly happened according to
plan, though we did find afterwards that
settings were wrong (now fixed).
Two switches failed to come back after the power-down:
in the main server room, and
sr22 in the SMSR. The hot-spare
was swapped in for the latter, on the basis that the former has a bonding
sr09). There then followed a lengthy discussion
with HPE over whether they would actually replace two failed switches
under lifetime warranty, which they eventually did "this time". It's now
being followed up with our supplier.
Based on the UPS emails, and the nut and snmptrap logs, here's a rough timeline (all in, or converted to, BST):
crystal's UPS (in the 5A closet) goes on-battery
crystal's UPS runs down
rattle's UPS is recharged to 15% and turns on
knussen's UPS is recharged to 15% and turns on
dutoit's UPS is recharged to 15% and turns on
We speculated last time as to whether "the other" 3kVA UPS would recalibrate as a result of the power-down to match the one we tested. It didn't. It's presumably not due to battery age, as it actually has a more recently replaced battery. The age of the units is very similar (mid-2005), so it's probably not that either. Recharge times were all rather faster than the test too. A mystery...
s04Forum bar report that it's getting near overload on bank 1 (the one nearest to the input cable). It appears that there is a group of heavy-load machines located low down in rack 2.
Those machines' managers might want to liaise with each other to rearrange things to spread the load across both banks. If the bank breaker trips the machines will all go off.
The Forum bars have a status page here (and there are corresponding pages for AT and JCMB linked from those sites' netmon pages). Loads are reported in deciAmps, because that's what the bars themselves use. For a bar with two banks (Forum, AT) the first load value is the total value, and the second and third are for bank 1 and bank 2. For bars without separate banks (JCMB) only the total load is reported.
Test started on schedule at 08:05. "The UPS and generator behaved as expected..."
(11:12) "The 'Black Start' power testing at the Appleton Tower datacentre is complete. No problems were noted during the testing. Thanks to everyone for their help and cooperation."
It appears we didn't ever set class of service (CoS) for the phones VLAN in Appleton Tower. We have CoS=6 in the Forum, which prioritises voice traffic over (most) data traffic. We propose setting CoS=6 in AT too.
Note that this actually affects the Forum phones too, in principle, as traffic for them is bridged from EdLAN-AT through via our AT core. Whether EdLAN has any CoS setting is another question; but at least if we have it set then within our own network there shouldn't be any level-related issues caused by our own traffic.
(CoS is carried in the VLAN tag field of the ethernet packet. It's therefore not possible for untagged ports to attempt to increase traffic priority. There might, however, be a case for setting CoS for private VLANs, as some of those are passed tagged to the end ports.)
The installation of
charmoz (a Dell R330) revealed what seems
to a problem with the current iDRAC firmware - or, at least, a problem
which manifests itself in the interface between that firmware and the
Namely: an attempt to use
to set the
root password of the iDRAC to our canonical
password fails - and, what's worse, it fails silently.
Watching the exchange on the wire, what's happening is that
the handshaking/version-negotiation between
ipmitool and the
iDRAC results in a 16-character-maximum password (i.e. an IPMI v1.5 password)
being agreed on, after which
ipmitool therefore sends only the
first 16 characters of our 20-character password.
Note that this problem occurs whether
is communicating with the iDRAC over the network, or locally (using the IPMI
The workaround - for now - is to explicitly set the iDRAC's root user's
password using the BIOS - and then simply not to use our
Please contact us with any comments or corrections.
Unless explicitly stated otherwise, all material is copyright The University of Edinburgh