Submitting Jobs using PBS on Linux Clusters
On the Linux clusters, job submission is performed through PBS. You can find information about the following topics here.
- Submitting Batch job
- Submitting Multiple Dependent jobs
- Interactive Parallel Sessions
- Useful PBS Commands
List of useful PBS directives and their meaning:
- #PBS -q queuename: Submit job to the queuename queue.
- Allowed values for queuename: single, workq, checkpt.
- Depending on cluster, addition values allowed are gpu, lasigma, mwfa, bigmem.
- #PBS -A allocationname: Charge jobs to your allocation named allocationname.
- #PBS -l walltime=hh:mm:ss: Request resources to run job for hh hours, mm minutes and ss seconds.
- #PBS -l nodes=m:ppn=n: Request resources to run job on n processors each on m nodes.
- #PBS -N jobname: Provide a name, jobname to your job to identify it when monitoring job using the qstat command.
- #PBS -o filename.out: Write PBS standard output to file filename.out.
- #PBS -e filename.err: Write PBS standard error to file filename.err.
- #PBS -j oe: Combine PBS standard output and error to the same file. Note you will need either #PBS -o or #PBS -e directive not both.
- #PBS -m status: Send an email after job status status is reached. Allowed values for status are
- a: when job aborts
- b: when job begins
- e: when job ends
- The arguments can be combined, for e.g. abe will send email when job begins and either aborts or ends
- #PBS -M your email address: Address to send email to when the status directive above is trigerred.
List of useful PBS environmental variables and their meaning:
- PBS_O_WORKDIR: Directory where the qsub command was executed
- PBS_NODEFILE: Name of the file that contains a list of the HOSTS provided for the job
- PBS_JOBID: Job ID number given to this job
- PBS_QUEUE: Queue job is running in
- PBS_WALLTIME: Walltime in secs requested
- PBS_JOBNAME: Name of the job. This can be set using the -N option in the PBS script
- PBS_ENVIRONMENT: Indicates job type, PBS_BATCH or PBS_INTERACTIVE
- PBS_O_SHELL: value of the SHELL variable in the environment in which qsub was executed
- PBS_O_HOME: Home directory of the user running qsub
h3
h4
▶ Table of Contents
The current batch job manager on Dell Linux clusters is PBS. To send a batch job to PBS, users need to write a script that is readable by PBS to specify their needs. A PBS script is bascially a shell script which contains embedded information for PBS. The PBS information takes the form of a special comment line which starts with #PBS and continues with PBS specific options.
Two example scripts, with comments, illustrates how this is done. To set the context, we'll assume the user name is myName, and the script file is named myJob.
1. A Serial Job Script (One Process)
To run a serial job with PBS, you might create a bash shell script named myJob with the following contents:
#!/bin/bash
#
# All PBS instructions must come at the beginning of the script ,before
# any executable commands occur.
#
# Start by selecting the "single" queue, and providing an allocation code.
#
#PBS -q single
#PBS -A your_allocation_code
#
# To run a serial job, a single node with one process is required.
#
#PBS -l nodes=1:ppn=1
#
# We then indicate how long the job should be allowed to run in terms of
# wall-clock time. The job will be killed if it tries to run longer than this.
#
#PBS -l walltime=00:10:00
#
# Tell PBS the name of a file to write standard output to, and that standard
# error should be merged into standard output.
#
#PBS -o /scratch/myName/serial/output
#PBS -j oe
#
# Give the job a name so it can be found readily with qstat.
#
#PBS -N MySerialJob
#
# That is it for PBS instructions. The rest of the file is a shell script.
#
# PLEASE ADOPT THE EXECUTION SCHEME USED HERE IN YOUR OWN PBS SCRIPTS:
#
# 1. Copy the necessary files from your home directory to your scratch directory.
# 2. Execute in your scratch directory.
# 3. Copy any necessary files back to your home directory.
# Let's mark the time things get started with a date-time stamp.
date
# Set some handy environment variables.
export HOME_DIR=/home/myName/serial
export WORK_DIR=/scratch/myName/serial
# Make sure the WORK_DIR exists:
mkdir -p $WORK_DIR
# Copy files, jump to WORK_DIR, and execute a program called "demo"
cp $HOME_DIR/demo $WORK_DIR
cd $WORK_DIR
./demo
# Mark the time it finishes.
date
# And we're out'a here!
exit 0
Once the contents of myJob meets your requirements, it can be submitted with the qsub command as so:
qsub myJob
Back to Top
2. A Parallel Job Script (Multiple Processes)
To run a parallel job, you would follow much the same process as the previous example. This time the contents of your file myJob would contain:
#!/bin/bash
#
# Use "workq" as the job queue, and specify the allocation code.
#
#PBS -q workq
#PBS -A your_allocation_code
#
# Assuming you want to run 16 processes, and each node supports 4 processes,
# you need to ask for a total of 4 nodes. The number of processes per node
# will vary from machine to machine, so double-check that your have the right
# values before submitting the job.
#
#PBS -l nodes=4:ppn=4
#
# Set the maximum wall-clock time. In this case, 10 minutes.
#
#PBS -l walltime=00:10:00
#
# Specify the name of a file which will receive all standard output,
# and merge standard error with standard output.
#
#PBS -o /scratch/myName/parallel/output
#PBS -j oe
#
# Give the job a name so it can be easily tracked with qstat.
#
#PBS -N MyParJob
#
# That is it for PBS instructions. The rest of the file is a shell script.
#
# PLEASE ADOPT THE EXECUTION SCHEME USED HERE IN YOUR OWN PBS SCRIPTS:
#
# 1. Copy the necessary files from your home directory to your scratch directory.
# 2. Execute in your scratch directory.
# 3. Copy any necessary files back to your home directory.
# Let's mark the time things get started.
date
# Set some handy environment variables.
export HOME_DIR=/home/$USER/parallel
export WORK_DIR=/scratch/myName/parallel
# Set a variable that will be used to tell MPI how many processes will be run.
# This makes sure MPI gets the same information provided to PBS above.
export NPROCS=`wc -l $PBS_NODEFILE |gawk '//{print $1}'`
# Copy the files, jump to WORK_DIR, and execute! The program is named "hydro".
cp $HOME_DIR/hydro $WORK_DIR
cd $WORK_DIR
mpirun -machinefile $PBS_NODEFILE -np $NPROCS $WORK_DIR/hydro
# Mark the time processing ends.
date
# And we're out'a here!
exit 0
Once the file myJob contains all the information for the desired parallel process, it can be submitted it with qsub, just as before:
qsub myJob
Back to Top
3. Shell Environment Variables
Users with more experience writing shell scripts can take advantage of additional shell environment variables which are set by PBS when the job begins to execute. Those interested are directed to the qsub man page for a list and descriptions.
Back to Top
4. Last line issue in PBS job script
Due to a PBS scheduler issue, please always make sure you have a new line at the end of the job script, or the last command line of the job script might be ignored by the scheduler. For example, the line myjob.exe in below job script will be ignored by the PBS scheduler.
#!/bin/bash
#PBS -l nodes=1:ppn=20
#PBS -l walltime=1:00:00
#PBS -q workq
#PBS -A allocation_name
myjob.exe(END_OF_FILE)
Instead, adding a new line at the end of file will resolve this issue:
#!/bin/bash
#PBS -l nodes=1:ppn=20
#PBS -l walltime=1:00:00
#PBS -q workq
#PBS -A allocation_name
myjob.exe
(END_OF_FILE)
Back to Top
Users may direct questions to sys-help@loni.org.
PBS Job Chains and Dependencies
Quite often, a single simulation requires multiple long runs which must be processed in sequence. One method for creating a sequence of batch jobs is to execute the "qsub" or "llsubmit" command to submit its successor. We strongly discourage recursive, or "self-submitting," scripts since for some jobs, chaining isn't an option. When your job hits the time limit, the batch system kills them and the command to submit a subsequent job is not processed.
In PBS, you can use the "qsub -W depend=..." option to create dependencies between jobs.
qsub -W depend=afterok:<Job-ID> <QSUB SCRIPT>
Here, the batch script <QSUB SCRIPT> will be submitted after the Job, <Job-ID> was successfully completed. Useful options to "depend=..." are
- afterok:<Job-ID> Job is scheduled if the Job <Job-ID> exits without errors or is successfully completed.
- afternotok:<Job-ID> Job is scheduled if the Job <Job-ID> exited with errors.
- afterany:<Job-ID> Job is scheduled if the Job <Job-ID> exits with or without errors.
One method to simplify this process is to write multiple batch scripts, job1.pbs, job2.pbs, job3.pbs etc and submit them using the following script:
#!/bin/bash
FIRST=$(qsub job1.pbs)
echo $FIRST
SECOND=$(qsub -W depend=afterany:$FIRST job2.pbs)
echo $SECOND
THIRD=$(qsub -W depend=afterany:$SECOND job3.pbs)
echo $THIRD
Modify script according to number of job chained jobs required. The Job <$FIRST> will be placed in queue while the jobs <$SECOND> and <$THIRD> will be placed in queue with the "Not Queued" (NQ) flag in Batch Hold. When <$FIRST> is completed, the NQ flag will be replaced with the "Queued" (Q) flag and will be moved to the active queue.
A few words of caution: If you list the dependency as "afterok"/"afternotok" and your job exits with/without errors then your subsequent jobs will be killed due to "dependency not met".
Users may direct questions to sys-help@loni.org.
Interactive Parallel Sessions
h3
h4
▶ Table of Contents
An interactive session is a set of compute nodes which allow one to
manually interact (ala shell, etc) with your programs while taking
advantage of dedicated multiple processors/nodes. This is useful for
development, debugging, running long sequential jobs, and testing.
The following is meant to be a quick guide on how to achieve such a
session on various LSU/LONI resources:
Note 1: these methods should work for all the Linux clusters
on LONI/LSU, but the host names (e.g., tezpur.hpc.lsu.edu is used as
the host name in the following) will need to reflect the machine that
is being used. This is also the case with the ppn= (processors
per node) keyword value (e.g., QB2 would be ppn=20).
Note 2: the commands below conform to the bash shell
syntax. Your mileage may differ if you use a different shell.
Note 3: this method will require opening 2 terminal
windows.
1. Interactive Method
1. In the terminal 1 window, login to the head node
of the desired x86 Linux cluster:
ssh -XY username@tezpur.hpc.lsu.edu
2. Once logged onto the head node, the next step is to
reserve a set of nodes for interactive use. This is done by issuing
a qsub command similar to the following:
$ qsub -I -A allocation_account -V -l walltime=HH:MM:SS,nodes=NUM_NODEs:ppn=4
- HH:MM:SS - length of time you wish to use the nodes (resource
availability applies as usual).
- NUM_NODEs - the number of nodes you wish to have.
- ppn - must match the number of cores available per node (system
dependent).
You will likely have to wait a bit to get a node, and you will see
a "waiting for job to start message" in the mean time. Once a prompt
appears, the job has started.
3. After the job has started, the next step is to determine
which nodes have been reserved for you. To do this, examine the
contents of the node list file as set for you in
the PBS_NODEFILE environment variable by the PBS system. One
way to do this, and an example result, is:
$ printenv PBS_NODEFILE
/var/spool/torque/aux/xyz.tezpur2
xyz is some number of digits representing the job number on
tezpur.
4. Your terminal 1 session is now connected to the
rank 0, or primary, compute node. You should now determine its host
name:
$ hostname
tezpurIJK
Where IJK is a 3 digit number.
5. To actually begin using the node, repeat step 1 in a
second terminal, terminal 2. Once logged onto the head node,
connect from there to the node determined in step 4. The two steps
would look like:
On you client: $ ssh -XY username@tezpur.hpc.lsu.edu
On the headnode: $ ssh -XY tezpurIJK
You have two ways to approach the rest of this process, depending
on which terminal window you want to enter commands in.
Back to Top
2. Using Terminal 2
6. In terminal 2 set the environmental
variable, PBS_NODEFILE, to match what you found in step 3:
$ export PBS_NODEFILE=/var/spool/torque/aux/xyz.tezpur2
7. Now you are set to run any programs you wish,
using terminal 2 for your interactive session. All X11 windows
will be forwarded from the main compute node to your client PC for
viewing.
8. The "terminal 2" session can be terminated and
re-established, as needed, so long as the PBS job is still
running. Once the PBS job runs out of time, or the "terminal 1"
session exits, the reserved nodes will be released, and the process
must be repeated from step 1 to start another session.
Back to Top
3. Using Terminal 1
6. In terminal 2, determine the value of the
environmental variable, DISPLAY as so:
$ printenv DISPLAY
localhost:IJ.0
Here IJ is some set of digits.
7. Now in terminal 1, set the environmental
variable, DISPLAY to match:
$ export DISPLAY=localhost:IJ.0
8. At this point, use terminal 1 for your interactive
session commands; all X11 windows will be forwarded from the main
compute node to the client PC.
9. The "terminal 2" session can be terminated and
re-established, as needed, so long as the PBS job is still
running. Once the PBS job runs out of time, or the "terminal 1"
session exits, the reserved nodes will be released, and the process
must be repeated from step 1 to start another session.
Back to Top
4. The Batch Method
Sometimes an interactive session is not sufficient. In this case,
it is possible to latch on to a batch job submitted in the
traditional sense. This example shows how to reserve a set of nodes
via the batch scheduler. Interactive access to the machine, with a
properly set environment, is accomplished by taking the following
steps.
Note: this method only requires 1 terminal.
1. Login to the head node of the desired x86 Linux cluster:
$ ssh -XY username@tezpur.hpc.lsu.edu
2. Once on the head node, create a job script, calling it
something like interactive.pbs, containing the following. This
is a job that simply sleeps and wakes to spin time:
#!/bin/sh
#PBS -A allocation_account
echo "Changing to directory from which script was submitted."
cd $PBS_O_WORKDIR
# create bash/sh environment source file
H=`hostname`
# -- add host name as top line
echo "# main node: $H" > ${PBS_JOBID}.env.sh
# -- dump raw env
env | grep PBS >> ${PBS_JOBID}.env.sh
# -- cp raw to be used for csh/tcsh resource file
cp ${PBS_JOBID}.env.sh ${PBS_JOBID}.env.csh
# -- convert *.sh to sh/bash resource file
perl -pi -e 's/^PBS/export PBS/g' ${PBS_JOBID}.env.sh
# -- convert *.csh to csh/tcsh resource file
perl -pi -e 's/^PBS/setenv PBS/g' ${PBS_JOBID}.env.csh
perl -pi -e 's/=/ /g' ${PBS_JOBID}.env.csh
# -- entering into idle loop to keep job alive
while [ 1 ]; do
sleep 10 # in seconds
echo hi... > /dev/null
done
3. Submit the script saved in step #2:
$ qsub -V -l walltime=00:30:00,nodes=1:ppn=4 interactive.pbs
4. You can check for when the job starts using qstat,
and when it does, the following happens:
- 2 files are created in the current directory that contain the
required environmental variables:
- <jobid>.env.sh
- <jobid>.env.csh
- the job is kept alive by the idle while loop
5. Determine the main compute node being used by the job by
inspecting the top line of either of the 2 environment files
$ % head -n 1 <jobid>.env.sh
# main node: tezpurIJK
Where IJK is some set of digits.
6. Login to the host specified in step 5; and be sure to
note the directory from which the job was submitted:
$ ssh -XY tezpurIJK
7. Source the proper shell environment
$ . /path/to/<jobid>.env.sh
8. Ensure that all the PBS_* environment variables
are set. For example:
$ env | grep PBS
PBS_JOBNAME=dumpenv.pbs
PBS_ENVIRONMENT=PBS_BATCH
PBS_O_WORKDIR=/home/estrabd/xterm
PBS_TASKNUM=1
PBS_O_HOME=/home/estrabd
PBS_MOMPORT=15003
PBS_O_QUEUE=workq
PBS_O_LOGNAME=estrabd
PBS_O_LANG=en_US.UTF-8
PBS_JOBCOOKIE=B413DC38832A165BA0E8C5D2EC572F05
PBS_NODENUM=0
PBS_O_SHELL=/bin/bash
PBS_JOBID=9771.tezpur2
PBS_O_HOST=tezpur2
PBS_VNODENUM=0
PBS_QUEUE=workq
PBS_O_MAIL=/var/spool/mail/estrabd
PBS_NODEFILE=/var/spool/torque/aux//9771.tezpur2
PBS_O_PATH=... # not shown due to length
9. Now this terminal can be used for interactive commands;
all X11 windows will be forwarded from the main compute node to the
client PC
Back to Top
5. Notes and Links
The methods outlined above are particularly useful with the
debugging tutorial.
Back to Top
Users may direct questions to sys-help@loni.org.
h3
h4
▶ Table of Contents
1. qsub for submitting job
The command qsub is used to send a batch job to PBS. The basic usage is
qsub pbs.script
where pbs.script is the script users write to specify their needs. qsub also accept command line arguments, which will overwrite those specified in the script, for example, the following command
qsub myscript -A my_LONI_allocation2
will direct the system to charge SUs (service units) to the allocation my_LONI_allocation2 instead of the allocation specified in myscript.
Back to Top
2. qstat for checking job status
The command qstat is used to check the status of PBS jobs. The simplest usage is
qstat
which would give informations similar to the following:
Job id Name User Time Use S Queue
------------------- ---------------- --------------- -------- - -----
2572.eric2 s13pic cott 00:00:00 R checkpt
2573.eric2 s13pib cott 00:00:00 R checkpt
2574.eric2 BHNS02_singleB palenzuela 0 Q checkpt
2575.eric2 BHNS02_singleC palenzuela 00:00:00 R checkpt
2576.eric2 BHNS02_singleE palenzuela 00:00:00 R checkpt
2577.eric2 BHNS02_singleF palenzuela 00:00:00 R checkpt
2578.eric2 BHNS02_singleD palenzuela 00:00:00 R checkpt
2580.eric2 s13pia cott 0 Q workq
The first column to the six column show the id of each job, the name of each job, the owner of each job, the time consummed by each job, the status of each job (R corresponds to running, Q correcponds to in queue ), and which queue each job is in.
qstat also accepts command line arguments, for instance, the following usage gives more detailed information regarding jobs.
[ou@eric2 ~]$ qstat -a
eric2:
Req'd Req'd Elap
Job ID Username Queue Jobname SessID NDS TSK Memory Time S Time
-------------------- -------- -------- ---------- ------ ----- --- ------ ----- - -----
2572.eric2 cott checkpt s13pic 28632 6 1 -- 48:00 R 24:51
2573.eric2 cott checkpt s13pib 13753 6 1 -- 48:00 R 15:29
2574.eric2 palenzue checkpt BHNS02_sin -- 8 1 -- 48:00 Q --
2575.eric2 palenzue checkpt BHNS02_sin 10735 8 1 -- 48:00 R 08:04
2576.eric2 palenzue checkpt BHNS02_sin 30726 8 1 -- 48:00 R 07:52
2577.eric2 palenzue checkpt BHNS02_sin 24719 8 1 -- 48:00 R 07:51
2578.eric2 palenzue checkpt BHNS02_sin 23981 8 1 -- 48:00 R 07:31
2580.eric2 cott workq s13pia -- 6 1 -- 48:00 Q --
Back to Top
3. qdel for cancelling a job
To cancel a PBS job, enter the following command.
qdel job_id [job_id] ...
Back to Top
4. qfree to query free nodes in PBS
One useful command for users to schedule their jobs in an optimal way is "qfree", which shows free nodes in each queue. For example,
[ou@eric2 ~]$ qfree
PBS total nodes: 128, free: 14, busy: 111 *3, down: 3, use: 86%
PBS checkpt nodes: 128, free: 14, busy: 98
PBS workq nodes: 64, free: 14, busy: 10
PBS single nodes: 16, free: 14, busy: 1
(Highest priority job on queue workq will start in 6:47:09)
shows that there total 14 free nodes in PBS, they are available in all the three queues: checkpt, workq and single.
Back to Top
5. showstart for estimating the starting time for a job
The command showstart can be used to get an approximate estimation of the starting time of your job, the basic usage is
showstart job_id
The following shows an simple example:
[ou@eric2 ~]$ showstart 2928.eric2
job 2928 requires 16 procs for 1:00:00:00
Estimated Rsv based start in 7:28:18 on Wed Jun 27 16:46:21
Estimated Rsv based completion in 1:07:28:18 on Thu Jun 28 16:46:21
Best Partition: base
Back to Top
Users may direct questions to sys-help@loni.org.
The queuing system schedules jobs based on the job priority which takes in account several factors. Jobs with a higher job priority are scheduled ahead of jobs with a lower priority. Also it has a backfill capability when scheduling jobs that are short in duration or require a small number of nodes. That is the scheduler schedules small jobs while waiting for the start time of any large job requiring many nodes. In determining which jobs to run first, Moab is using the following formula to calculate job priority:
Job priority = credential priority + fairshare priority + resource priority + service priority
(1) Credential Priority Subcomponent:
credential priority = credweight * (userweight * job.user.priority) credential priority = 100 * (10 * 100) = 100000 ( a constant )
(2) Fairshare Priority Subcomponent:
fairshare priority = fsweight * min (fscap, (fsuserweight * DeltaUserFSUsage)) fairshare priority = 100 * (10 * DeltaUserFSUsage)
A user's fair share usage is the sum of seven days of used daily processor seconds times daily decay factor divided by the sum of seven days of daily total processor seconds used times the daily decay factor. The decay factor is 0.9. DeltaUserFSUsage is the fair share target percent for each user (20 percent) minus the the calculated fair share usage percent. In other words the target percentage minus the actual used percentage. For a user who has not used the cluster for a week:
fairshare priority = 100 * (10 * 20) = 20000
(3) Resource Priority Subcomponent:
resource priority = resweight * min (rescap, (procweight * TotalProcessorsRequested) resource priority = 30 * min (3840, (10 * TotalProcessorsRequested)
For instance, for a 32 processor job:
resource priority = 30 * 10 * 32 = 9600
(4) Service Priority Subcomponent:
service priority = serviceweight * (queuetimeweight * QUEUETIME + xfactorweight * XFACTOR ) service priority = 2 * (2 * QUEUETIME + 20 * XFACTOR) QUEUETIME is the time the job has been queued in minutes. XFACTOR = 1 + QUEUETIME / WALLTIMELIMIT
For a one hour job in the queue for one day:
service priority = 2 * (2 * 1440 + 20 * (1 + 1440 / 60 ) ) service priority = 2 * (2880 + 500 ) = 6760
These factors are adjusted as needed to make jobs of all sizes start fairly.