core1switch ended up in its top-half-only state and was rebooted to recover the lower-half cards. This was later traced to a non-optimal use of circuits and local UPSes, which has since been fixed.
sr04failed its self-test. We configured a spare, and swapped it out with some help from Gilbert. (The failed switch is now being replaced by HP.)
s17.pdu) completely lost its configuration. It did the same thing on a previous power outage. The effect was that its outlets came up in an apparently random on/off state. We've now reconfigured it, but we suggest that it should be replaced.
bw05.pdu) came up minus its network interface. A paper-clip 'reset' had no effect, but totally power-cycling it got it back. Obviously, totally power-cycling the power bar when it's fully in use would be very disruptive. Suggestion is to view this power bar as suspect, and withdraw it from use when the Beowulf goes.
u03in the self-managed server room tripped its 20A circuit-breaker in the circuit board. The working assumption is that, since the power on of individual socket in these power bars isn't staged, (in fact, can't be staged), the initial surge was too much. The 'normal' load on this power bar is modest: only 6A or so.
tycho's BIOS settings weren't quite right, and as a result it didn't automatically reboot when power was restored.
reevesalso didn't automatically reboot when power was restored.
linnaeusgot stuck at the GRUB menu when rebooting. As it has a serial console this was probably just bad luck.
mckinleyappeared to get stuck at the 'Configuring memory' message at the very earliest stage of the machine powering on. Cycling the power fixed it.
blatiereisn't on any of the local UPSes. Perhaps it should be.
hickox; since moved to
abbado) had gone down, and wasn't helped by the unreasonably long timeout values used by the cgi-bin script. There doesn't appear to be any way to tune this in NUT itself, unfortunately, so turning down
tcp_syn_retrieson the netInf machines (on which the script runs to generate the status page) might well be necessary.
We could do with another round of reboots for switch firmware upgrades. Pending include:
core2in the Forum and
atc1in AT have been done. The other two will have some user-visible effects while they're down. There are FPGA and boot-ROM upgrades included, and the whole process takes around 4 minutes. HP classify this particular upgrade as "Recommended - Refers to important product or system information. Suggested improvements should be applied at your earliest convenience. There is potential for sub-optimal performance, lock-ups or unwanted shutdowns. This will protect your system from a serious, but not unrecoverable failure."
cs0in JCMB. This is essentially the same version as for the Forum and AT core.
HP have tracked down our multicast-router-discovery bug (RT#62354) and there'll be a fix for it in the next release of the 2610 firmware. Meantime we have a pre-release version for testing, which we'll try out on some likely switches and then roll out if it seems to work.
There is apparently power work proposed for AT. Alan Reid writes: "Due to essential cabling works associated with the redevelopment of 50 George Square, which requires a connection to the AT essential services board, a complete power shutdown of Appleton Tower is required.
"This will affect the power to the server rooms, however the generator will provide power for the duration of the works.
"The generator will be tested fully beforehand but the switchover mechanism can't be tested until the power is switched off so if it fails to switch over for any reason, the work will be abandoned and rearranged for a later date (but not much later!)
"The risk to services can be considered low but there is of course a risk.
"This work must be done as soon as possible, therefore the proposed date and time is - Saturday 23rd November, 08.00-12.00.
"Please let me know by 16.00 on Friday 15th (I'm away Wed & Thur) if there are any objections to this.
"Apologies for the short notice but this has just been flagged to us."
Following on from the previous batch of emergency upgrades, IS have announced the upgrade of the AT router for 07:00 on Tuesday 19th. This will affect bridged traffic (Wire-B aka Transit, VoIP phones, wireless). Routed traffic will fail over.
IS have now more or less got the charging model for 10Gbps EdLAN links in place, and are nearly ready to upgrade us. The most likely approach will be to upgrade our AT link from 2x1Gbps to 1x10Gbps, adjust our Forum router VLAN assignments so that most external traffic goes through the now-faster AT link, and keep the (1Gbps) OC link mainly as a backup.
We could also upgrade our OC link. It's less obvious that this would
be useful, as at the moment we don't have the (Linux) routing capacity to
drive it much faster than 1Gbps. If we did want to go faster we would
have to add 10GbaseT to the Forum
core (though that's perhaps something we should
be thinking about doing anyway), add 2 10GbaseT ports to (at least) the
main external router, and upgrade the
core3 switch to
something 10G-capable. And pay the one-off EdLAN second-link upgrade fee.
We do have the switch capacity at JCMB to go to 10Gbps, so the decision as to whether to upgrade that one is "just" down to money. We are seeing regular discards on that link though (graph).
EdLAN network diagrams are here.
Mahesh's group have asked for .11a to be turned off for the Forum upper floors again. (RT#63035)
Following on from last time...
ac.ukis signed, but
ed.ac.ukisn't yet. Ideally it would be, so that our own zones then were part of the chain from the root. There might still be some advantage in signing our own zones beforehand, though, both to gain experience and to secure them internally by deploying local trust anchors.
We're now hosting a "ping7" machine in the Forum server room for IS (RT#64424). Status pages can be found here and here.
The new cosign server
mcintyre is now in
berlin will be retired this week - probably
early on Thursday morning.
Please contact us with any comments or corrections.
Unless explicitly stated otherwise, all material is copyright The University of Edinburgh