White dot for spacing only
The Dice Project

(Inf logo) Operational Meeting Infrastructure Unit Report
27th March 2019

  1. Forum server room UPSes

    As noted in the chatroom last week, the Forum server room UPSes are now running at well over 50% on L1, and close to that on L2 and L3. If we are treating the current UPS system as 1+1 resilient, then this overload is clearly unsustainable. An email was therefore sent to all research staff on Monday, to the effect that we can not accept any new machines unless some corresponding load is turned off first.

    PLEASE DO NOT TURN ON ANY NEW MACHINES in the Forum server rooms without first consulting with us. If necessary HoC or HoS will be called on to arbitrate.

    (On the other hand if, as might have been the case when they were installed, the intention was to provide a higher load capacity then the current load is within spec. However, given the historic unreliability of these UPSes, the worry here is that should one unit develop a fault again and take itself off-line, then the other unit would see a large overload and take itself off-line too, or worse.)

    In the medium term, the replacement UPS system that's expected to be installed before July should solve this problem. Note, however, that during the UPS replacement process, if we are unable to carry the server power load on only one of the two UPSes which constitute the current server room UPS system, we will be forced either to reduce overall server load until that does become possible or, otherwise, to run all of our servers with no UPS protection at all.

    In any event, the related problem of physical space for our server population will remain, of course.

    Our (ongoing) inventory of self-mangaged machines is here.

  2. Wilkie

    Reminder: the date for moving out of Wilkie is fast approaching. If you still have any kit there you will have to arrange to move it all out by 5th April. Recent enhancements to the netmon index pages should make it a bit easier to see what's what. We have stopped nagios monitoring for the switches over there. Ports on "our" VLANs have been turned off.

    Note that the Infrastructure Unit can provide access to the Wilkie IT closets only; we do not have keys to individual rooms.

  3. Appleton Tower datacentre work

    The UPS battery replacement and relocation work happened last Thursday, as scheduled, and seemed to go off without incident. The UPS came back on line not long after mid-day, based on the email it sent us. The nut logs on gatti, which polls it every couple of minutes or so, say that it was off-line shortly before 09:26:48 and back on-line shortly before 12:18:48.

  4. Generator tests

    Paul Hutton writes:
    "IT Infrastructure, in conjunction with Estates, is planning to carry out "Black Start" testing of the JCMB and Appleton Tower datacentres. This testing will allow us to check that the standby power systems at both datacentres are functioning as expected.

    "The test will involve failing each datacentre's power onto the backup power system, then checking that the UPS, backup power switching infrastructure and generator are performing as we would expect. It is planned to run the datacentre on generator power for three hours, with Estates using this time to monitor the performance of the backup generator, allowing them to check fuel consumption along with other metrics such as oil pressure. In the week before the test, Estates will have checked that the standby generators are functioning correctly and have sufficient fuel for the duration of the test.

    "On the day of the test, staff from IS, Estates and the generator supplier will be onsite to monitor the test, and back out if required.

    "Tests are conducted on a 6-monthly timetable, scheduled in the Spring teaching vacation, and at the end of teaching block 1 in Semester 1.

    "The dates proposed for the next series of tests are

    Both tests will start at 8am, and conclude by midday.

    "I'd be grateful if you could provide the contact details of staff from your team or division who should be updated during the course of the test.

    "Lastly, can you please let me know as soon as possible if these dates present a major "show stopping" problem for your department or division? I'm aware that any testing of this type can be extremely inconvenient for staff, but we're very limited in the times of year when it can take place. IT Infrastructure will post the required alerts in the run up to the tests."

  5. JCMB power work

    Please see this email concerning impending power work at JCMB in June. In summary, there will be two sets of interruptions to the CSR power while the building is transferred to a generator farm to allow for the substation transformers to be replaced; and there will be an additional interruption to revert the CSR power to its normal feed and to test the generator.

    These will take place on:

    We expect our UPSes to provide power to one bar in each of our main CSR racks. All other bars will drop out. The restoration of power to outlets on these will be delayed until such time as we expect the UPS batteries to have recharged.

  6. iptables logging

    For some time now we have configured iptables on most of our edge routers such that packets which reach the very end of the rules were simply dropped, but had kept the DEV OpenVPN endpoint set up to use our global default which is to log first and then drop everything. The idea here was that it was sometimes useful to get an idea of what was scanning us, as traffic to the unused OpenVPN subnet addresses would indicate clearly that whatever it was couldn't have been initiated at our end.

    (These log messages also go to the console, which is reasonable on a machine with a local display but can overwhelm a serial console, even one running at the fastest speed.)

    Unfortunately it appears that this policy is now unsustainable. The kernel logs from the DEV endpoint are just getting too large, and disc usage on the loghost copernicus is creeping up. Therefore, the default for the "netinf" machines is now to drop everything silently. We can easily override this, but currently don't.

    The global default remains as it has always been: log and then drop.

  7. snort

    snort is now running again on the Forum and AT "external routers" and OpenVPN endpoints. The order of the sections produced by the "reporting" tool has been shuffled slightly to bring more locally-relevant ones nearer the top.

    In addition to the specfile issues mentioned last time, it turns out that there was also a patch required for the pulledpork tool that we use to fetch the most recent rules. The minor version of snort is now a two-digit number, and apparently quite a few snort-related tools were not prepared for that!

    As a further complication, the heuristic that we use to determine when an RPM has installed a new configuration file didn't work for the current case where the RPMs were uninstalled and then reinstalled (as opposed to being upgraded). An even more horrible additional check has now been added to catch this case. (As we noted in the IDS project's final report, snort is not nearly as configurable generally as at first appears if the whole toolchain is taken into account. As a result, the component has to generate its configurations into the "standard" places, while having to take into account that they might be splatted by an RPM at any time; and it can't just generate a new configuration completely from scratch because the "sample" configurations distributed with the packages contain a lot which is very much more than just an example of what to do, with content which changes from version to version.)

  8. SLAAC for server wires

    Forward SLAAC-style addresses for the two "external" wires E42 and E160 were enabled last week. In practice there were no additions as a result of this, as everything on the wires had non-SLAAC addresses anyway. Enabling it for B (64) threw up a makeDNSv6 oddity which will need a little more investigation, so it has been left off for that one for now. Wires R (AT self-managed) and B (transit) are the only two IPv6-enabled subnets which now do not have forward SLAAC-style addresses generated by default.

    As things stand it is not possible to set up an IPv6-only host to have SLAAC-style DNS entries. The MAC address which is used is taken ultimately from the dhclient.mac resource via a spanning map; but setting that resource also instructs the dhcpd component to generate an IPv4 entry, which it can't do and will grumble about by email if there is no IPv4 DNS entry. There would seem to be a couple of possible approaches when the time comes when this is a requirement, as opposed to something that would be nice to have but can be worked around using static addresses for a few cases:

    1. Duplicate (most of) the dhclient.mac resources into a different set of resources, and pass those to the DNS master using a different spanning map. This would allow additional non-IPv4 hosts to be added, but would increase the DNS master's already large profile even more.
    2. Add a new dhclient resource to instruct the dhcpd component not to generate specific IPv4 entries, even though there are MAC addresses for them. (Something like this, perhaps in a more general form, might turn out to be necessary anyway when we come to look at DHCPv6.)

    Post-meeting addition: If you don't mind setting the IPv6 address on your machine then you can use any (valid) IID part. If you really want to have your machine auto-configure itself SLAAC-style, you should let it do so and then copy the IID that it has used into dns/inf6.

  9. Nagios and IPv6

    Most of our servers are now running as fully-fledged dual-network-stack machines. The consequence is that users of any services offered by these machines might now legitimately be connecting via either IPv4 or IPv6 addresses - and we expect this to be the position for many years to come.

    At the previous Operational Meeting, we were asked to report on the implications of this fact for our Nagios service, the background being that it was noticed that web servers locateable via IPv6 addresses were apparently not being monitored by Nagios.

    For our conclusions on this, please see our auxiliary Nagios and IPv6 report.

    Summary: We collectively have quite a bit of work to do in order to get Nagios monitoring of IPv6 services operational - it certainly won't just 'start by itself.' It is ironic that we only have started thinking about all of this having been prompted by the apparent lack of monitoring of IPv6 web servers: the translator for the lcfg-apacheconf component is likely to be the most complicated case that we'll have to deal with.

    Related question: Many of our web servers currently have multiple IPv4 addresses in order to implement Apache vhosts, but all currently appear to have only one IPv6 address. What is the intention with respect to IPv6 addressing and vhosts? - it has obvious implications for the Nagios monitoring discussed here.

  10. DNS query analysis

    It turned out to be useful to have a tool which can analyse BIND querylog files, so there's now one in /usr/lib/net/netman-scripts (on those machines which carry the RPM, including all of the network servers). Amusingly, while testing it on linnaeus aka dns0.inf aka dns.dcs, one run had as its three highest hits AAAA queries for IS's nameservers, and had queries for the "dcs" versions of our nameservers' names more popular than the "inf" versions.

    The tool also tallies queries using IPv4 and those using IPv6. It's roughly 2:1 for IPv4 as against IPv6.

  11. Some history

    We mentioned in the chatroom recently (??) that it used to be the case that Solaris boxes liked to use the same MAC address for all their interfaces, but our switches weren't keen on this. The lcfg "object" which fixed this up was rescued from an old Sun a few years ago before it was scrapped, and then lost again; but it has recently resurfaced! It's here for your edification...

  12. Switch firmware

    Since none of the ATL7 guinea-pigs have reported any bad effects from last week's switch firmware upgrade, we'll roll it out more widely. We'll start with the rest of the AT switches, and once it has run there for a while we'll move on to the Forum and Bayes switches.

 : Operational : Meetings 

Mini Informatics Logo - Link to Main Informatics Page
Please contact us with any comments or corrections.
Unless explicitly stated otherwise, all material is copyright The University of Edinburgh
Spacing Line