|
Condor is a system for grabbing spare CPU cycles from standard desktop machines by managing and scheduling jobs onto idle machines. There are three classes of machine in the condor universe, master nodes, submit nodes and execute nodes. The master node assigns jobs to execute nodes based on which execute nodes are idle, the priority of the job and the resource requirements of the job. Submit nodes accept jobs from the user and pass information about the job to the master node for scheduling, they then contact the nominated execute host and pass the job on to it. Submit nodes also maintain a shadow process which communicates with the running job and handles certain processes like checkpointing. Execute hosts run the jobs that have been passed to them by the submit host. It is possible for individual computers to act as any combination of the three classes of machine, but there can only be one master node per condor pool. It is possible to flock two or more pools together which will allow jobs queues on a busy pool to migrate and be run on execute hosts in another pool.
The department has two condor pools, atpool and forumpool with corresponding masters focke.inf.ed.ac.uk and wulf.inf.ed.ac.uk . The two pools are flocked to provide a certain amount of redundancy in case one of the master nodes falls over. Currently any authorised user can submit jobs to any host however we may inpose restrictions or priorities on certain machines depending on where the funding for the hardware comes from.
We are starting to roll out condor in the labs. Condor is enabled in the <dice/studentlabs-....h > headers by including the approrpiate pool header as detailed below.
Because it is difficult to integrate condor and DICE Authorisation mechanisms the lab machines are not enabled as submit hosts. Users will not be able to submit jobs from lab machines and any undergraduates given access to the system will have to submit jobs from the cluster head nodes (focke and wulf).
The condor component will compare the sysinfo.location entry with the list of rooms in condor.quietlabs and if the location matches any of the room the component will stop any condor processes running. We should not be running condor on any room designated as "quiet".
Please add any additional information you think may be useful.
Forum based hosts should have
#include <dice/options/condor-forumpool.h>added to their profile, whilst hosts at other sites should have
#include <dice/options/condor-atpool.h>to theirs. All hosts will pick up
#include <live/condor.h>from these pool header files. Once the profile has been updated on the client then you should run om updaterpms run followed by om condor start. If all has gone well then within a minute or two the host will have registered itself with the master and running condor_status should produce a listing of all the hosts in that pool. It may take longer (up to half an hour) for the new node to appear in the list of available condor nodes.
Condor's built in access control mechanisms don't mesh well with ours and it's more flexible to use the auth.uers mechanism to allow access to the submit hosts. Currently the default access based on roles is as follows:
Condor can be configured on a host by host basis to only allow jobs to run at certain times of the day. This is controlled by the condor.wheesht condor.startwheesht and condor.stopwheesht resources.
NB condor.startwheesht and condor.stopwheesht are specified in seconds after midnight.
There are very few housekeeping issues, generally the pools can be left to get on with things themselves. Occasionally you may have to log onto one of the pool masters and remove or reschedule jobs. If a dice box is removed from the poolthe cluster will drop it from it's database after 5 minutes or so, if it returns then it will automatically get added to the list of available machines.
Useful commands include:
Condor currently has issues with LDAP in situations where the PC hosting the job runs low on memory it's likely that the slapd database will become corrupted and uid lookups will fail. The main symptom of this is when a user is unable to log into the DICE box in question. The main solution is to log onto the machine as root and restart the ldap component forcing it to repopulate the LDAP database (see man lcfg-openldap for details).
|
Please contact us with any
comments or corrections.
Unless explicitly stated otherwise, all material is copyright The University of Edinburgh |
|