The minutes for 10th June were brought up for approval. Ian stated that the part concerning the reconfiguration of the GPU machine PSUs, particularly that the meeting had agreed that this should be done, did not match his recollection of the meeting. There were no confirmations of events from the other attendees of the meeting either way and so the convener stated that since this very topic was to be re-discussed at the end of this meeting, he would wait until this had taken place and then amend the minutes if necessary.
There are still a few outstanding. Both Toby and Chris pointed out that many of these concerned things which are tricky to check out from home and that there might therefore be legitimate reasons for delaying checking these pages.
Actions added or revived:
Blog articles discussed:
Alison believes she has written this, but may not have published it.
Report from Computing Executive Group
Reports from Units
Now scheduled for 23/6/2020.
Chris mentioned that he has copied this information into the MPU XRDP documentation.
Chris mentioned that there will be more KVM server reboots coming up but that they should cause minimal disruption.
Neil asked if there was any update on the XRDP certificate issue. Chris replied that this was already on the agenda for the next MPU meeting.
Now sitting in the Physics storeroom.
It was clarified that 'onsite' RT tickets did not need to go into a special queue.
Topics for discussion
Iain reiterated his plans for changing the configuration of some ILCC GPU servers so that the PSUs are in non-redundant mode, thus sharing the load on the PDUs across all the PSUs. Iain has raised this possibility with ILCC and they are happy for Iain to go ahead, particularly as the loss of one PDU means that their file servers will go down anyway.
Ian reminded the meeting that there are actually two power banks per PDU and it is the individual banks which are susceptible to overloading. Currently on PDU S33, there are 7 connections in total, 5 on bank 1 and 2 on bank 2. This needs to be balanced out. Ian also pointed out that the rack this PDU was attached to was full and that if all the machines in the rack were to start using the same PDU, it would be overwhelmed in any case. Perhaps we are installing too many of these machines per rack?
Ian also observed that from his reading of the T630 documentation, this hardware has two redundant PSU options: redundant with a hot spare or redundant with no hot spare and that he recommended the latter option. Iain said that he hadn't been able to find that option in the machine's web interface but Ian assured him that it was there.
Ian observed that it is good that we are finally getting some clarity on what the power requirements are for GPU servers. At the moment, the best we can probably do is to make sure that power connections for these machines are equally distributed between PDUs and that the active PSUs of machines (as much as that can be determined) are equally distributed across individual banks of the PDUs. Unfortunately, until we have physical access to the server room, it will be difficult to check what the current situation is though Iain did suggest a couple of ways by which this might be done remotely. He also observed that in his experience, the active PSU of these machines seemed to swap over in an apparently random fashion.
Neil pointed out that it would be important to label the racks so that people working on the racks would know if a machine's PSU could be disconnected without bringing the whole machine done. Iain said that this was already done for those racks which contained GPU machines with non-redundant power supplies. He also pointed out that his investigations had shown that the current pulled by all Dell servers, not just GPU servers, was surprisingly high and that this issue should not be considered to be relevant only to GPU racks. George mentioned that he was taking the question of redundant power in the server rooms to CEG for discussion. Graham, Iain, Ian and Neil all agreed that redundant power in the non-GPU racks was very important, Ian pointing out the occasional necessity to replace PDUs. Though the final decision will rest with CEG, there seemed to be some agreement that power redundancy was something that we might not be able to offer on specific GPU racks. Looking ahead, Neil wondered if we might be able to used different coloured power cords to indicate whether a power supply was redundant or not.
Iain will make the proposed changes to the GPU machines PSU configurations and report back.
The next meeting will be on Wednesday 24th June 2020 online at 10am.
Please contact us with any comments or corrections.
Unless explicitly stated otherwise, all material is copyright The University of Edinburgh