![]() |
The AT comms UPS #1 turned itself off at self-test time on Tuesday 17th March. This is very reminiscent of the behaviour of one of the Forum comms UPSes in December 2012, which resulted in us having to replace the UPS. However, it didn't turn itself off at its subsequent self-test, on March 31st. The next one is due on Tuesday 14th ...
ipmi-sensors
As mentioned in our
report for the previous meeting,
our machine gatti
(a Dell PE R320) started correctly reporting a failed PSU
via nagios after its UPS supply had temporarily switched off - but it then incorrectly
kept reporting the same fault after the UPS power supply had returned.
We've now seen the same behaviour on another machine, namely the User Support machine
hyde
, a Dell PE R420.
Our conclusion is that the current power supply check being done by the
hwmon
script via ipmi-sensors
is, unfortunately, not reliable.
We suspect that this problem might be fixed
by firmware upgrades to the BMC - but we haven't tried that yet. It might be the
case that the use of Dell-specific utilities (rather than standard IPMI tools) to interrogate the
BMC might get better results. Or, maybe something else is wrong, and/or misconfigured somewhere!
Summary: this isn't just an Inf Unit problem; it potentially affects all of the School's servers.
The "old-style" OpenVPN endpoints were turned off on Tuesday 31st at around 07:30 (RT#71252). There had previously been two blog articles, two sys-announce messages and one round of personal reminder emails. There had still been 19 users (including a couple of COs!) from the weekend to the turn-off time. So far there don't appear to have been any RT tickets as a result.
The /23 subnet released has been recycled for ATLABS use.
Another rack is being installed in the SMSR, as we're aware of quite a few U in the pipeline. After this one, there's no viable space for any more racks, nor is there power under the floor for any more. So, once the remaining space has filled up, we'll have to start turfing old machines out.
The cluster has been removed from the two racks by the door, and those racks have been renamed 'Rack 15' and 'Rack 16' to match our existing pattern. For the moment, we'll use those racks as a place to put some self-managed rack-mount kit, and as a temporary place to park old machines (RT#71308). This should allow us to get the remaining 'main' rack (Rack 14) into use.
Tardis will be decanting to Rack 15, with access strictly by chaperone. We've 'upgraded' the switch which serves Racks 15 and 16 and the shelving, so that we can turn on a bit more protection than was possible with the ancient 2824 and 2650 switches previously in use.
The two AT-basement server rack switches were replaced on Thursday and Friday 26th and 27th March. All seemed to go smoothly, with no reports of vital things going off the air due to bonding issues. Thanks to all involved!
Automatic processing of prometheus lifecycle (this means sending of email to accounts entering grace and disabling accounts which have exceeded grace) will go live once some issues with database accounts flip-flopping have been addressed.
Note that the Prometheus documentation has been given a revamp/update - see https://wiki.inf.ed.ac.uk/DICE/PrometheusOverview.
In February 2014, we were advised that ESISS scans would come from addresses in the two ranges 62.69.82.0/28 & 46.17.59.192/27 - but that advice also mentioned that the range of source addresses 'might expand in the future.'
Last week, we noticed what appeared to be an ESISS scan coming from the address 158.255.227.56.
We are trying to clarify this, but so far have no definite information.
Please contact us with any
comments or corrections.
Unless explicitly stated otherwise, all material is copyright The University of Edinburgh |
![]() |