We did better with this power cut. After the previous one, we'd added the necessary UPS monitoring header and #define, so this time our servers seemed to shutdown before the last of the battery drained. So avoiding salvaging and fsck's.
Another disk failure, this time in one of the research SAN boxes. It went out of contract in December, and they are still deciding on whether to renew, or replace. In the meantime we've replaced the disk with one of our decommissioned 1TB spares.
During the SL7 upgrades of servers, RW volumes were moved around so we could do the upgrades without affecting users. During that time the usual RW -> Offsite RO mappings, recorded on the wiki AFSPartitions page, got out of sync. This page is also used as source of info for the output of the server info pages on https://groups.inf.ed.ac.uk/cos/disk_script/servers.html.
All those mappings have now been fixed, and the wiki page, and hence the server info pages, are now correct again. So if support are looking for places to create group space, or move users, then these pages are now accurate.
We had an issue with the server huldra, which we thought would need a reboot to fix. Before rebooting, I like to make sure the serial console is working. This lead me to discover the reason for the problem (the messages on serial console), and so was able to fix the immediate need for a reboot.
However, as I'm sure has been mentioned before, and I know some of us have experienced similar problems, huldra's serial console is showing historical log information, at the time going back to Jan 18th. It's logging the error pretty much every second. eg
2017-01-25T05:49:34.430515+00:00 huldra kernel: blk_update_request: I/O error, dev sdap, sector 0sending any char to the console prints another 16 chars of logging info. Normally this isn't too bad, and pressing RETURN a few times empties the last of some buffer and you get the login: prompt. In this case about 1,296,000 seconds had passed between the error being introduced and being fixed, so about the same log messages. Each messages is about 97 chars, which needs 6 chars sent to display it (16 chars at a time). So I reckon I need to send about 7.7 million characters (RETURNS) to the serial console before I get my login prompt back!
ipmitool mc reset cold does not flush the buffer. And
I seem to recall that if it were to reboot, then the process would stall
until it had echo'd out all those chars (16 at a time per key
press). I believe only a cold power off clears the buffer, implying
it's is some hardware buffer in the Dell that's storing this output?
For now I have a script running away sending RETURN chars to the console. It's been running since the weekend, and has got the log up to Jan 25th. So perhaps by the end of the week I'll have a login: prompt again.
If I had the time, I might try switching to a physical serial console (rather than IPMI) to see if that makes a difference.
Is anyone keeping a secret fix for this to themselves?
We seem to have 95 machines with the "services-unit.h" header. 40 of them are SL7, so 55 to go.
Unit meets are linked from the ServicesUnitMeetings wiki page.
We have been meeting, but we've not been so good at getting the minutes/notes published recently.
Please contact us with any comments or corrections.
Unless explicitly stated otherwise, all material is copyright The University of Edinburgh