Difference between revisions of "Sungrid Use"
(→Restricting to the same machines) |
(→Common Options) |
||
(5 intermediate revisions by 2 users not shown) | |||
Line 46: | Line 46: | ||
Specify that the executable is a binary and not a script. | Specify that the executable is a binary and not a script. | ||
+ | |||
+ | -V | ||
+ | |||
+ | Specify that the job and the shells environmental variables should be the same. | ||
== Submitting Jobs == | == Submitting Jobs == | ||
+ | |||
+ | When you submit a job, it sources your ".profile" instead of your ".bashrc" for configuration since it is a "non-interactive" shell. | ||
You can submit jobs in one of two ways: | You can submit jobs in one of two ways: | ||
Line 61: | Line 67: | ||
qsub -S /bin/sh -m es -M myid@soe.ucsc.edu -cwd -b y -N "myid_hard_area_FP" \ | qsub -S /bin/sh -m es -M myid@soe.ucsc.edu -cwd -b y -N "myid_hard_area_FP" \ | ||
-e hard_results/n10.err -o hard_results/n10.log ../simanneal/simanneal -- -i ../benchmarks/hard/n10 -o hard_results/n10 | -e hard_results/n10.err -o hard_results/n10.log ../simanneal/simanneal -- -i ../benchmarks/hard/n10 -o hard_results/n10 | ||
+ | |||
+ | Sometimes, I have found that the "--" isn't needed... I'm not sure why. | ||
=== Script === | === Script === | ||
Line 84: | Line 92: | ||
You can also specify other requirements such as available memory with the "-l" command. | You can also specify other requirements such as available memory with the "-l" command. | ||
+ | |||
+ | To list a all hosts | ||
+ | qconf -ss | ||
+ | |||
+ | To list of cluster queues | ||
+ | qconf -sql | ||
+ | |||
+ | == Administraiton == | ||
+ | |||
+ | Somtimes the machines will get in an error state (E): | ||
+ | |||
+ | qstat -f | ||
+ | |||
+ | will show | ||
+ | |||
+ | ---------------------------------------------------------------------------- | ||
+ | all.q@mada3.cse.ucsc.edu BIP 0/4 0.00 lx24-amd64 E | ||
+ | |||
+ | You can diagnose the error by typing: | ||
+ | |||
+ | qstat -explain E | ||
+ | |||
+ | ---------------------------------------------------------------------------- | ||
+ | all.q@mada3.cse.ucsc.edu BIP 0/4 0.00 lx24-amd64 E | ||
+ | queue all.q marked QERROR as result of job 18526's failure at host mada3.cse.ucsc.edu | ||
+ | |||
+ | Then you can force clear a queue with: | ||
+ | |||
+ | qmod -cq all.q@mada3.cse.ucsc.edu | ||
+ | |||
+ | or the entire queue with: | ||
+ | qmod -c '*' |
Latest revision as of 06:50, 12 July 2012
Contents
Basic Commands
We have our machines setup in a grid using a batch queue system. You must run the job on one of the sungrid machines (qconf -sh to see them). You can check your job status with:
qstat
You can delete jobs with:
qdel
where you can specify a specific job ID or all your jobs with "-u yourid".
To submit jobs, you use qsub as shown later.
Common Options
-S /bin/sh
For the script option (you did not specify -b y), this specifies which interpreter to use.
-m es
Mail the user on the end or suspension.
-M user@soe.ucsc.edu
What email address to send to.
-cwd
Start the simulation from the current working directory.
-o soc_fpu.log
Standard output.
-e soc_fpu.err
Standard error.
-N "fpu SOC P&R"
A name for the job.
-b y
Specify that the executable is a binary and not a script.
-V
Specify that the job and the shells environmental variables should be the same.
Submitting Jobs
When you submit a job, it sources your ".profile" instead of your ".bashrc" for configuration since it is a "non-interactive" shell.
You can submit jobs in one of two ways:
Command Line
qsub <options> -b y <binary> -- <options for the binary>
Example:
qsub -S /bin/sh -m es -M myid@soe.ucsc.edu -cwd -b y -N "myid_hard_area_FP" \ -e hard_results/n10.err -o hard_results/n10.log ../simanneal/simanneal -- -i ../benchmarks/hard/n10 -o hard_results/n10
Sometimes, I have found that the "--" isn't needed... I'm not sure why.
Script
Instead of specifying all the options on the command line, you can specify them in a script like this:
#!/bin/sh #$ -S /bin/sh #$ -m es #$ -M myid #$ -cwd #$ -o hard_results/n10.log #$ -e hard_results/n10.err #$ -N "myid_hard_area_FP" ../simanneal/simanneal -i ../benchmarks/hard/n10 -o hard_results/n10 exit 0
Restricting to the same machines
Often for papers, you will want to restrict your simualtions to identical machines (e.g. you want to compare run-times). In order to do this, we have a hostgroup for all of the mada servers (mada1-7). You can submit to these machines by doing:
qsub -q *@@mada <other arguments>
You can also specify other requirements such as available memory with the "-l" command.
To list a all hosts
qconf -ss
To list of cluster queues
qconf -sql
Administraiton
Somtimes the machines will get in an error state (E):
qstat -f
will show
---------------------------------------------------------------------------- all.q@mada3.cse.ucsc.edu BIP 0/4 0.00 lx24-amd64 E
You can diagnose the error by typing:
qstat -explain E
---------------------------------------------------------------------------- all.q@mada3.cse.ucsc.edu BIP 0/4 0.00 lx24-amd64 E queue all.q marked QERROR as result of job 18526's failure at host mada3.cse.ucsc.edu
Then you can force clear a queue with:
qmod -cq all.q@mada3.cse.ucsc.edu
or the entire queue with:
qmod -c '*'