Resources > CGT Data Cluster
General
CGT's Data Cluster consists of 16 compute nodes and one head node (dc-user2.isi.edu).
Jobs are managed by LSF and should be submitted through gatekeeper running
on the headnode. The compute nodes have the following characteristics:
- RedHat 9.0 (same as dc-user2)
- Dual Intel PIII 550MHz CPUs
- 1.5 GB RAM
- Gigabit network connection
- 85 GBs of /scratch space
- Access to Action home directories and other Action NFS file systems
Monitoring
A GRIS is running on the standard port on dc-user2. A LSF GRAM reporter
is installed and provides information on jobs in the queueing system.
There
is also Ganglia metrics (such as CPU, memory and network utilization)
available at http://dc-master.isi.edu/ganglia/
Accessing the Cluster
dc-user2 is a general purpose machine where jobs can be compiled and tested.
The head node, dc-user2.isi.edu, is accessible through ssh, but is restricted
to CGT. Like any ISI host, dc-user2 only accepts ssh public key authentication.
ISI accounts are not enabled by default, so if you need access to dc-user2,
send an email to cgt-support@isi.edu .
GridFTP
For now, each node is running GridFTP. This will probably not be the case
in the future as the GridFTP daemon and bandwidth usage may affect
the job that has been scheduled for the node. Please do not write jobs
that assume GridFTP is running on the compute nodes.
Queues for Running Jobs
The following table describes the available queues:
| low |
Low priority jobs. Use this queue for jobs that
will run for a longer time, for example over night. |
| normal |
Normal priority jobs. Use this queue for most jobs.
This is the default queue. |
| high |
High priority jobs. Access is restricted. |
| exclusive |
Jobs that need exlusive access to the node, as opposed
to being scheduled for maximum throughput. This queue could be
useful for running benchmark applications. |
Use dc-user2.isi.edu/jobmanager-lsf as a resource string to submit
jobs to the cluster.
Condor-G example job:
# always use the globus universe with condor-g
universe = globus
# use the full path to the executable
executable = /nfs/asd/rynge/jobs/myjob.sh
# do not transfer the executable - if this is true
# the executable will be transfered from the current
# directory
transfer_executable = false
# this specifies where the job should be submittied
# and what jobmanager to use
globusscheduler = dc-user2.isi.edu/jobmanager-lsf
# additional globus rsl
globusrsl = (jobtype=multiple)(count=4)(queue=normal)
# specify the output files
output = job.$(cluster).out
error = job.$(cluster).err
log = job.$(cluster).log
# now queue it queue
Available software
Software common to the whole cluster is installed in /cluster By sourcing
the setup.{sh|csh} files, you will get the right environment for that piece
of software. There are also 'default' symlinks to the most current version.
For example, to use Globus with a bourne shell, run:
. /cluster/globus/default/setup.sh
or, for csh:
source /cluster/globus/default/setup.csh
What CAs are trusted?
We trust most of the common ones. Feel free to suggest one if it is
missing. The trusted CAs are:
- Globus Alliance
- USC
- UK eScience
- NPACI
- NCSA
- DOE
- TACC
How do I get into the grid-mapfile?
Send an email with your subject to cgt-support@isi.edu.
How do I make jobs run on specific nodes in the cluster?
We have added a 'host_selection' rsl parameter that maps to the bsub
-m lsf host selection attribute.
The value is a list of hostnames seperated
by spaces.
For example:
globusrun -o -r dc-user2.isi.edu/jobmanager-lsf \
'&(queue=exclusive)(count=2) \
(host_selection="dc-n12 dc-n13") \
(executable=/bin/hostname)'
Why do short jobs get so poor performance?
If a short job is only a couple of minutes, it is because of Globus
and scheduling overhead.
Try to group smaller tasks into one longer job.
Can I have SSH access to the compute nodes?
No!
I want a Globus gatekeeper on each node. How do I do that?
Here is an example of how to run personal gatekeepers on the cluster.
To start the gatekeepers, run:
globusrun -o -r dc-user2.isi.edu/jobmanager-lsf \
-f /nfs/asd/rynge/jobs/personal-gatekeeper/job.rsl
When submitting jobs to the gatekeepers, you must specify the subject
you are expecting from the resource. This is not the host one for personal
gatekeeper, but your user cert.
For example:
globusrun -a -r \
dc-n2.isi.edu::'/C=US/O=NPACI/OU=SDSC/CN=Mats Rynge/USERID=ux454281'
See job.sh and job.rsl in /nfs/asd/rynge/jobs/personal-gatekeeper/ for more information.
The only argument to job.sh is the number of
seconds you want to leave the gatekeepers running.
|