![]() |
Apologies for absence
Alison, Archie and Craig had sent their apologies.
minutes of the last meeting.
These were accepted.
Report from Computing Executive Group
This item was not taken.
Reports from units.
Infrastructure.
Toby reported that the KDC and LDAP slave servers, authportal and the lcfg2ldap machines had in fact all been upgraded to FC5 before the Christmas holiday. He will be pushing out the latest ldap upgrade (release 2.3.32) to all machines next week. He commented on his investigations into why slapd was apparently taking up so much memory and it appears that this is due to large amounts of memory (approximately 300MB) being reserved for BDB. Changing this will require a modified openldap component because just changing the resource is not enough, the daemon needs to be restarted. The replacement of basilisk by a new FC5 server is now constrained by Simon's availability; it is likely to take place in February.
George reported that the FC5 routers were still up, confirming their increased stability after the introduction of a modified kernel. An even newer special kernel has now also been installed on them and they have stayed up for at least two days which is again reassuring. He will now embark on the process of upgrading the other routers to FC5. He has moved the routers out of the development release into the testing release and will shortly move them into the stable release. This has partly been made possible by allowing the test kernels to be stipulated for a profile by including the test-kernel.h in the live headers area.
Iain is working on upgrading exeter.inf, one of the JCMB console servers, to FC5 tomorrow. If all goes well during this second attempt then he might be able to upgrade the other console servers very soon.
The network has been basically stable recently. George has installed new firmware in the core switches and the SRIF switches. EdLAN has had a few problems, especially that part affected by the Hugh Robson Building router.
When one of solti's RAID disks failed recently (it has since been replaced) and the router was shut down it became apparent that some of the MDP machines were seriously affected. It is not clear why this was and it will need to be investigated. It also affected tftp transfers during PXE installs. This seems to be because the dhcp servers were still sending out the router address for the router that was down.
George has upgraded the firmware on most of the wireless access points and has booted this new firmware on all of them except for those in Buccleuch Place.
Managed Platform.
Stephen reported that there has been some intensive work on merging all the Solaris support back into the standard headers managed via svn. The Solaris server profiles are still being compiled from old headers managed via rfe. A test profile is being built for each of the Solaris hosts using the new headers and this is then compared with the existing profile. Once the differences between these pairs of profiles are insignificant for all hosts then the Solaris profiles will be built from the standard headers rather than the old rfe managed ones.
It is also planned to merge the FC5-64 and FC6 headers. This has been done at the LCFG level and eventually it will include the DICE level. In view of all this merging of headers it is important not to use default values via the C preprocessor macros as these will almost certainly be wrong; if anything is architecture specific then it should be defined architecture by architecture.
As mentioned above the test kernels can now be used on a host by including the live/test-kernel.h header.
There is now only one machine that requires to be rebooted to pick up the glibc change.
The diydice server, dresden.inf, will be down for about half an hour this Thursday morning.
Research and Teaching.
Tim reported that all the beowulf clusters were now running FC5 as was the beowulf LCFG server. In doing these upgrades the only issue that Iain became aware of was that the submission of MPI jobs to GridEngine broke; this appears to have been caused by a change in behaviour of sshd between FC3 and FC5.
The Informatics database server is to be upgraded to FC5 and Ingres 2006 Community Edition tomorrow, which just leaves a couple of simics servers, copacabana and ipanema which should be upgraded fairly soon.
Tim mentioned a problem he encountered when upgrading dendrite last week caused by inaccurate DNS addresses of one of the required services. It is conceivable that this was caused by a slightly out of date install CD.
One other big problem that was hit when clients were upgraded to use the most recent glibc under FC5 was that condor broke. It was necessary to install the more recent release of condor and then to stop and restart condor on each of the condor cluster machines. In order to reduce the risk of such an occurrence again the unit will be introducing a condor test pool which will be running the test release of the DICE distribution. Iain has enabled flocking to allow jobs to be migrated between condor pools. He will also disable the submission of jobs from condor cluster nodes apart from the master node, and access to the latter will be restricted to registered condor users.
They will be moving hawthorn.inf out to the Bush estate to connect it via fibre channel to the switch there allowing access to and control of part of the 150TB SRIF (Science Research Investment Fund) SAN (Storage Area Network). One of the Suns attached to the SAN will provide NFS access and hawthorn will be used eventually for GPFS (General Parallel File System) access.
Java 1.6 is now available on all clients (via alternatives), but the default remains version 1.5; one of the School's courses required this newer version.
The first stage of the work by Rosemary for the School RAE submission is now finished.
Services.
Another disk has failed in one of the SATABeasts at JCMB; its replacement has already arrived and will be installed by the end of the week. A cold spare will be ordered for this type of disk.
The mail server nutty was upgraded on Sunday. It took longer to do than Neil anticipated but this was partly to do with him being especially careful to make sure that all was well. The upgrade appears to have gone well.
There are still 9 servers to upgrade, including the 5 print servers. The unit will initially attempt to get LPRng working under FC5.
The unit has received an RT ticket requesting external svn access. Technically the Services Unit can't support this. They might look at using svn with http access as a possible future project (which would address this kind of problem).
The host drumelzier, which hosts the HealthAgents web site was upgraded to FC5 but Neil was asked to help when they had difficulty in getting their web site running again; it is based on zope and plone. It will be suggested that the site be brought back up again on an FC3 server until such time that the FC5 problems can be solved.
User Support.
Ken reported that of the 49 servers managed by the unit all but bu.inf (fc3.login) had been upgraded to FC5. Morna would be checking who was still using bu and why.
All of the machines that needed to be rebooted, either to pick up the most recent glibc or to make sure that the buggy kernel module for the video card on GX260s was not loaded, have now been rebooted apart from one server tammy (another server, alina, is requesting a reboot but Roger believes that this is unnecessary because it was triggered by a modification to fstab and the partition has already been manually mounted).
Since the last Operational Meeting the User Support Unit had handled 215 new RT tickets (equivalent to about 22 per working day) and resolved 60% of them. There had been a total of 204 tickets (including both new and existing tickets) resolved over the same period.
The rack-mount server dolly.inf has been hosting the devproj site since last Tuesday. Ken overlooked the inclusion of the ipfilter.h header file in the devproj.h header file initially and this prevented external access to the devproj server but this was fixed on Monday night. This prompted some discussion between Neil and Stephen about how live headers should be used.
Morna reported some of her initial impressions of interviewing members of the administrative, teaching and research staff in preparation for the design of the User Support/Services Units survey of users regarding publishing and discussion media. She was struck by the number of people who appear to be using externally provided services instead of services provided by us. As an example she mentioned the DTC which needs a content management system for its web site and has been considering going to an external provider that rents out space for £50 per year. There are members of staff who are running wikis on external provider sites. There are legal and security issues involved but it is not yet clear how these are perceived by the people using these external providers. One down side of staff going to external providers is the impact on 'branding', it doesn't look good if the impression is given that we can't or won't provide these services ourselves.
Neil asked whether the unit was aware of any performance problems with RT. It was agreed that there was some evidence for such a problem although there was no consensus as to what the symptoms were.
Stephen asked about computer accounts of former computing staff. These will be locked if they are not already.
AOCB
There was none.
Please contact us with any
comments or corrections.
Unless explicitly stated otherwise, all material is copyright The University of Edinburgh |
![]() |