![]() |
Chris
Stephen has sent minor corrections to Craig.
Actions completed:
It was reported, however, that the BP server room has now been cleared of machines. The technicians have been asked to move remaining racks. Units should now be focusing on planning the move of servers from FH.
Actions added:
Ensure that shutting down the beowulf cluster is more manageable.
Ensure that netmon is more readily available.
Ensure that all machines in the server room have their corresponding entries in the fpdu/sxx.outlets maps labelled correctly.
Decide on an appropriate procedure for shutting down self-managed servers.
Investigate best use of temperature monitoring.
Define the criticality of machines.
Text messages to be sent to willing COs and policy on subsequent action to be decided.
Reports from units.
More details on the upgrade will be sent out when available but Stephen commented that we will need to plan when and how to roll out the upgrade.
Stephen pointed out that this issue could limit what we can do from home.
This has been introduced to spread the load on the cache servers e.g. for the matlab upgrade this Friday.
Tim asked that we try the new query tools as per instructions in the report.
Later this week, we will be sending out an e-mail inviting users for an informal chat over coffee. We hope that this will continue on a regular basis. We will also announce "CO clinics" which will be held approximately every 2 weeks (but rotating day of the week and time) aimed at giving users the opportunity to meet COs to have their technical questions answered. The exact format has not been agreed and will largely depend on the uptake of these sessions and the type of questions that are asked. We would aim to have a representative from each unit present.
It was generally agreed that there is little more that we could do to prepare for any future power outage and that the outage in December had been managed well. The next power outage will be for a longer period of time (approximately 1 day) and it is quite possible that it will be at a weekend. We have no more information as to when the next power shutdown will be.
Alastair went through a number of observations:
We cannot currently consider AT as a backup as it uses the same cooling system.
The backup cooling hadn't fired up but tests show it to be OK.
We could perhaps investigate using the backup cooling as the main source over holiday periods.
There is an extractor (switch outside the server room door) which we should test to see how effective it is.
Shutting down the beowulf cluster was problematic.
Using netmon.inf from outwith the University is unreliable.
Some machines in the server room do not have their corresponding entries in the fpdu/sxx.outlets maps labelled correctly.
We cannot shutdown machines in the self-managed server room cleanly. We need to come up with a suitable solution - should we hold passwords for these machines to be used in an emergency ?
We could define the criticality of machines and use a live header to shutdown machines on the basis of this criticality. Tim noted that the criticality of a machine may change depending on various factors (day of week, semester times etc.)
Machines have temperature monitoring. We should be using this. We could perhaps set the temperature at which a machine would be shutdown at a lower level for less critical machines.
Only Alastair currently receives text messages about temperature issues. It was suggested that more people might want to also receive such texts. How exactly this would work needs some more thought but as long as it was simple to opt in and opt out, most people agreed that this would be a good idea. A suggestion was made that as soon as a person received such an e-mail, they would join the chat room where people could decide on the best plan of action.
It is not necessarily easy to work out storage array dependencies. It was suggested that a wiki could be generated automatically to help with this.
AOCB
The next Operational meeting will be on Wednesday 27th January in IF-4.31 chaired by Alison Downie.
Please contact us with any
comments or corrections.
Unless explicitly stated otherwise, all material is copyright The University of Edinburgh |
![]() |