Sungrid Use

From Vlsiwiki
Jump to: navigation, search

Basic Commands

We have our machines setup in a grid using a batch queue system. You must run the job on one of the sungrid machines (qconf -sh to see them). You can check your job status with:

qstat

You can delete jobs with:

qdel

where you can specify a specific job ID or all your jobs with "-u yourid".

To submit jobs, you use qsub as shown later.

Common Options

-S /bin/sh

For the script option (you did not specify -b y), this specifies which interpreter to use.

-m es

Mail the user on the end or suspension.

-M user@soe.ucsc.edu

What email address to send to.

-cwd

Start the simulation from the current working directory.

-o soc_fpu.log

Standard output.

-e soc_fpu.err

Standard error.

-N "fpu SOC P&R"

A name for the job.

-b y

Specify that the executable is a binary and not a script.

-V

Specify that the job and the shells environmental variables should be the same.

Submitting Jobs

When you submit a job, it sources your ".profile" instead of your ".bashrc" for configuration since it is a "non-interactive" shell.

You can submit jobs in one of two ways:

Command Line

qsub <options> -b y <binary> -- <options for the binary>

Example:

qsub -S /bin/sh -m es -M myid@soe.ucsc.edu -cwd -b y -N "myid_hard_area_FP" \
-e hard_results/n10.err -o hard_results/n10.log ../simanneal/simanneal  -- -i ../benchmarks/hard/n10 -o hard_results/n10

Sometimes, I have found that the "--" isn't needed... I'm not sure why.

Script

Instead of specifying all the options on the command line, you can specify them in a script like this:

#!/bin/sh
#$ -S /bin/sh
#$ -m es
#$ -M myid
#$ -cwd
#$ -o hard_results/n10.log
#$ -e hard_results/n10.err
#$ -N "myid_hard_area_FP"
../simanneal/simanneal -i ../benchmarks/hard/n10 -o hard_results/n10 
exit 0

Restricting to the same machines

Often for papers, you will want to restrict your simualtions to identical machines (e.g. you want to compare run-times). In order to do this, we have a hostgroup for all of the mada servers (mada1-7). You can submit to these machines by doing:

qsub -q *@@mada <other arguments>

You can also specify other requirements such as available memory with the "-l" command.

To list a all hosts

qconf -ss

To list of cluster queues

qconf -sql

Administraiton

Somtimes the machines will get in an error state (E):

qstat -f

will show

----------------------------------------------------------------------------
all.q@mada3.cse.ucsc.edu       BIP   0/4       0.00     lx24-amd64    E

You can diagnose the error by typing:

qstat -explain E
----------------------------------------------------------------------------
all.q@mada3.cse.ucsc.edu       BIP   0/4       0.00     lx24-amd64    E
       queue all.q marked QERROR as result of job 18526's failure at host mada3.cse.ucsc.edu

Then you can force clear a queue with:

qmod -cq all.q@mada3.cse.ucsc.edu

or the entire queue with:

qmod -c '*'