White dot for spacing only
The Dice Project


Informatics Condor Pools

Overview

Condor is a system for grabbing spare CPU cycles from standard desktop machines by managing and scheduling jobs onto idle machines. There are three classes of machine in the condor universe, master nodes, submit nodes and execute nodes. The master node assigns jobs to execute nodes based on which execute nodes are idle, the priority of the job and the resource requirements of the job. Submit nodes accept jobs from the user and pass information about the job to the master node for scheduling, they then contact the nominated execute host and pass the job on to it. Submit nodes also maintain a shadow process which communicates with the running job and handles certain processes like checkpointing. Execute hosts run the jobs that have been passed to them by the submit host. It is possible for individual computers to act as any combination of the three classes of machine, but there can only be one master node per condor pool. It is possible to flock two or more pools together which will allow jobs queues on a busy pool to migrate and be run on execute hosts in another pool.

The Departments Condor pools

The department has two condor pools, atpool and forumpool with corresponding masters focke.inf.ed.ac.uk and wulf.inf.ed.ac.uk . The two pools are flocked to provide a certain amount of redundancy in case one of the master nodes falls over. Currently any authorised user can submit jobs to any host however we may inpose restrictions or priorities on certain machines depending on where the funding for the hardware comes from.

Condor and Student labs

We are starting to roll out condor in the labs. Condor is enabled in the <dice/studentlabs-....h > headers by including the approrpiate pool header as detailed below.

Because it is difficult to integrate condor and DICE Authorisation mechanisms the lab machines are not enabled as submit hosts. Users will not be able to submit jobs from lab machines and any undergraduates given access to the system will have to submit jobs from the cluster head nodes (focke and wulf).

Quiet labs.

The condor component will compare the sysinfo.location entry with the list of rooms in condor.quietlabs and if the location matches any of the room the component will stop any condor processes running. We should not be running condor on any room designated as "quiet".

Admin Information

Please add any additional information you think may be useful.

Adding hosts

Forum based hosts should have

#include <dice/options/condor-forumpool.h>
added to their profile, whilst hosts at other sites should have
#include <dice/options/condor-atpool.h>
to theirs. All hosts will pick up
#include <live/condor.h>
from these pool header files. Once the profile has been updated on the client then you should run om updaterpms run followed by om condor start. If all has gone well then within a minute or two the host will have registered itself with the master and running condor_status should produce a listing of all the hosts in that pool. It may take longer (up to half an hour) for the new node to appear in the list of available condor nodes.

User Access

Condor's built in access control mechanisms don't mesh well with ours and it's more flexible to use the auth.uers mechanism to allow access to the submit hosts. Currently the default access based on roles is as follows:

Academic Staff
Access to all School machines
Research Staff
Access to all School machines
Phd Students
Access to all School machines
MSc Students
No access to school machines (application supported by supervisor/Course leader)
Undergraduate students
No access to school machines (application supported by supervisor/Course leader)
Access can be granted by editing the live/condor-masternode.h header file and updating the auth.users resource.

Limiting the hours condor is active

Condor can be configured on a host by host basis to only allow jobs to run at certain times of the day. This is controlled by the condor.wheesht condor.startwheesht and condor.stopwheesht resources.

NB condor.startwheesht and condor.stopwheesht are specified in seconds after midnight.

Support Issues

Housekeeping

There are very few housekeeping issues, generally the pools can be left to get on with things themselves. Occasionally you may have to log onto one of the pool masters and remove or reschedule jobs. If a dice box is removed from the poolthe cluster will drop it from it's database after 5 minutes or so, if it returns then it will automatically get added to the list of available machines.

Useful commands include:

condor_status
Used to monitor and query the condor pool, by default this will display a list of hosts currently in the pool what platform they are running and information about their current state.
condor_q
Displays information about jobs in the queue. NB each condor node has a seperate queue so condor_q will just display the jobs submitted from the local host. In order to display information about all the queues in the pool you have to use the -global option which will display the queues of each host in turn. condor_q -better-analyze can be very effective in working out why jobs can't get a host to run on or why they're running on the host they are.
condor_vacate
causes Condor to checkpoint any running jobs on a set of machines and force the jobs to vacate the machine. The job remains in the submitting jobs queue.

Memory Limitations

Condor currently has issues with LDAP in situations where the PC hosting the job runs low on memory it's likely that the slapd database will become corrupted and uid lookups will fail. The main symptom of this is when a user is unable to log into the DICE box in question. The main solution is to log onto the machine as root and restart the ldap component forcing it to repopulate the LDAP database (see man lcfg-openldap for details).


 : Units : Research_and_teaching : Documentation : Beowulf 

Mini Informatics Logo - Link to Main Informatics Page
Please contact us with any comments or corrections.
Unless explicitly stated otherwise, all material is copyright The University of Edinburgh
Spacing Line