High Level Overview of Condor Configuration on Nemo
Highlights of the rhe current Condor configuration on the Nemo cluster includes:
- Higher memory jobs: For jobs needing more then 1004 MB of memory
we have configured nodes 2-60 so that a single Condor slot (previously
known as a virtual machine) is available on the CPU. This is accomplished by
NUM_CPUS = 1
in the local Condor configuration file for each of those individual nodes. This
single slot advertises in its ClassAd 2008 MB of memory.
Additionally in the /etc/init.d/condor file used to start the
condor_master process and all of its children (including user jobs) we include for nodes 2-60
ulimit -v 2056192
This limits jobs started by Condor to using 2056192 KB of memory.
(Note that 2056192 = 2008 * 1024.)
Note that because we have configured only a single Condor slot on
these nodes there is one core (each CPU has two cores) that is not
being utilized efficiently.
Users that want to run exclusively on these nodes must put the following
line in their Condor submit file:
requirements = Memory > 1004
- Standard jobs: Jobs not requiring more then 1004 MB of memory can utilize all
nodes 2-780. In particular nodes 61-780 are configured to have two identical
Condor slots for each CPU (one per core) with each one
advertising 1004 MB of memory. This is accomplished by setting (we override
the default so that it could be more easily changed later if so desired)
VIRTUAL_MACHINE_TYPE_1 = cpu=1, mem=1004
VIRTUAL_MACHINE_TYPE_2 = cpu=1, mem=1004
NUM_VIRTUAL_MACHINES_TYPE_1 = 1
NUM_VIRTUAL_MACHINES_TYPE_2 = 1
Additionally in the /etc/init.d/condor file used to start
the condor_master process and all of its children (including user jobs)
we include for nodes 61-780 the command
ulimit -v 1048576
This limits jobs started by Condor to using 1048576 KB of memory.
(Note that 1048576 > 1004 * 1024. We include a little bit of extra
space for jobs that go over the 1004 limit, but not by too much.)
- Non-condor heavy user use node: Node s0001 is not part
of the Condor pool. It is available to users who need to perform
intensive serial tasks (such as intensive compiling) that would otherwise
damage the performance of the head node. We make this node available so that
users will not have to burden the head nodes and make it hard for other
users to do simple tasks like logging in and running condor_submit.