HPC@LSU | Documentation | Slurm Job Submission

Submitting Jobs using SLURM on Linux Clusters

SLURM (Simple Linux Utility for Resource Management) is an open source, highly scalable cluster management and job scheduling system. It is used for managing job scheduling on new HPC and LONI clusters. It was originally created at the Livermore Computing Center, and has grown into a full-fledge open-source software backed up by a large community, commercially supported by the original developers, and installed in many of the Top-500 supercomputers.

Information about the following topics can be found here:

Submitting batch script (single node)
Submitting batch script (multiple nodes)
Submitting interactive jobs
Jobs Using GPUs
Commonly used SLURM Commands
Running Serial and Parallel (Multi-Threaded and Multi-Process) jobs

Serial Job
Shared Memory Parallelism (SMP) Jobs
MPI (Message Passing Interface) Job
Hybrid (MPI + SMP) Job

Submitting Multiple Dependent jobs
PBS to SLURM

Submitting batch script (single node)

To create a batch SLURM script, use your favorite editor (e.g. vi or emacs) to create a text file with both SLURM instructions and commands how to run your job. All SLURM directives (special instructions) are prefaced by the #SBATCH. Below is an example of a SLURM batch job script:

 #!/bin/bash
 #SBATCH -N 1               # request one node
 #SBATCH -t 2:00:00	        # request two hours
 #SBATCH -p single          # in single partition (queue)
 #SBATCH -A your_allocation_name
 #SBATCH -o slurm-%j.out-%N # optional, name of the stdout, using the job number (%j) and the hostname of the node (%N)
 #SBATCH -e slurm-%j.err-%N # optional, name of the stderr, using job and hostname values
 # below are job commands
 date

 # Set some handy environment variables.

 export HOME_DIR=/home/$USER/myjob
 export WORK_DIR=/work/$USER/myjob
 
 # Make sure the WORK_DIR exists:
 mkdir -p $WORK_DIR
 # Copy files, jump to WORK_DIR, and execute a program called "mydemo"
 cp $HOME_DIR/mydemo $WORK_DIR
 cd $WORK_DIR
 ./mydemo
 # Mark the time it finishes.
 date
 # exit the job
 exit 0

To submit the above job to the scheduler, save the above script as a text file, e.g., singlenode.sh, then use the below command to submit:

$ sbatch singlenode.sh

List of useful SLURM directives and their meaning:

#SBATCH -A allocationname: short for --account, charge jobs to your allocation named allocationname.
#SBATCH -J: short for --jobname, name of the job.
#SBATCH -n : short for --ntasks, number of tasks (CPU cores) to run job on. The memory limit for jobs is 4 GB of MEM per CPU core requested.
#SBATCH -N : short for --nodes, number of nodes on which to run.
#SBATCH -c : short for --ncpus-per-task, number of threads per process.
#SBATCH -p partition: short for --partition, submit job to the partition queue.
- Allowed values for partition: single, checkpt, workq, gpu, bigmem.
- Depending on cluster, addition partitions can be found via the sinfo command.
#SBATCH -t hh:mm:ss: short for --time, request resources to run job for hh hours, mm minutes and ss seconds.
#SBATCH -o filename.out: short for --output, write standard output to file filename.out.
#SBATCH -e filename.err: short for --error, write standard error to file filename.err.
- Note that by default, SLURM will merge stardard error and standard output to one file if no "-o" or "-e" flag is set.
#SBATCH --mail-user your@email.address: Address to send email to when the --mail-type directive below is trigerred.
#SBATCH --mail-type type: Send an email after job status typeoccurs. Common values for type include BEGIN, END, FAIL or ALL. The arguments can be combined, for e.g. BEGIN, END will send email when job begins and ends

List of common useful SLURM environmental variables and their meaning:

SLURM_JOBID: Job ID number given to this job
SLURM_JOB_NODELIST: List of nodes allocated to the job
SLURM_SUBMIT_DIR: Directory where the sbatch command was executed
SLURM_NNODES: Total number of nodes in the job's resource allocation.
SLURM_NTASKS: Total number of CPU cores requested in a job.

Submitting batch script (multiple nodes)

Creating multiple-node job script is very similar to the single node job script, with the difference of using multiple nodes. Below is an example of a multiple-node batch job script:

 #!/bin/bash
 #SBATCH -N 2                	# request two nodes
 #SBATCH -n 16 		       	# specify 16 MPI processes (8 per node)
 #SBATCH -c 6			# specify 6 threads per process
 #SBATCH -t 2:00:00
 #SBATCH -p checkpt
 #SBATCH -A your_allocation_name
 #SBATCH -o slurm-%j.out-%N # optional, name of the stdout, using the job number (%j) and the first node (%N)
 #SBATCH -e slurm-%j.err-%N # optional, name of the stderr, using job and first node values
 # below are job commands
 date

 # Set some handy environment variables.

 export HOME_DIR=/home/$USER/myjob
 export WORK_DIR=/work/$USER/myjob
 
 # load appropriate modules, in this case Intel compilers, MPICH
 module load mpich/3.1.4/INTEL-15.0.3
 # Make sure the WORK_DIR exists:
 mkdir -p $WORK_DIR
 # Copy files, jump to WORK_DIR, and execute a program called "my_mpi_demo"
 cp $HOME_DIR/mydemo $WORK_DIR
 cd $WORK_DIR
 srun -N2 -n8 -c6 /my_mpi_demo # Launch the MPI application with two nodes, 8 MPI processes each node, and 6 threads per MPI process.
 # Mark the time it finishes.
 date
 # exit the job
 exit 0

Note: in the examples above, the srun command is used to launch the MPI application. This will be the default behavior.

The syntax for the srun command is:

srun <flags> <name of the MPI executable>

Some useful flags are:

-N: number of nodes
-n: total number of MPI processes
-c: number of threads per MPI process
-u: turn on unbuffered output (the output from MPI processes will be flushed to stdout as soon as it's generate); without this flag, Slurm will buffer and rearrange the output according to the MPI ranks.

Submitting interactive jobs

To start an interactive job, use the salloc command similar to the example below:

 salloc -t 1:00:00 -n8 -N1 -A your_allocation_name -p single

Similar to the batch job script, the -n denotes 8 tasks (cores), the -N denotes 1 compute node. The complete form of the above command can be:

 salloc --time=1:00:00 --ntasks=8 --nodes=1 --account=your_allocation_name --partition=single

Note:

If an interactive job session is submitted to a partition other than single, the -n or --ntasks flag will be ignored and one or more entire nodes will be allocated to the job.
Our recommendation is to specify your allocation name (-A your_allocation_name) to the salloc command so a proper allocation can be used by the scueduler.

Jobs Using GPUs

For jobs using GPUs, the number of GPU devices must be explicitly specified using the “--gres=gpu:” flag.

Requesting One GPU

If a job cannot use multiple GPU devices efficiently or if running a test job, a user should request one GPU. In this case, The job will share a node with other jobs.

For an interactive session requesting one GPU:

salloc -t hh:mm:ss -N1 -n16 --gres=gpu:1 -p gpu_partition_name -A your_allocation_name

For a batch job requesting one GPU:

#!/bin/bash
#SBATCH -N 1
#SBATCH -n 16
#SBATCH -t hh:mm:ss
#SBATCH -p gpu_partition_name
#SBATCH --gres=gpu:1
#SBATCH -A your_allocation_name

commands to run

Please note that the valid values for the number of tasks ("-n") is between 1 and (total number of CPU cores on the node)/(total number of GPUs on the node). For instance, if a job request one GPU on a node with 64 cores and 4 GPUs, the valid value for "-n" is from 1 to 64/4=16.

Requesting More Than One GPU (But Less Than A Node)

Users can request more than one GPU on a node (e.g. 2 or 3 GPUs on a node with 4 GPUs). In this case, The job will also share a node with other jobs.

For an interactive session requesting multiple GPUs:

salloc -t hh:mm:ss -N1 -n32 --gres=gpu:2 -p gpu_partition_name -A your_allocation_name

For a batch job requesting multiple GPUs:

#!/bin/bash
#SBATCH -N 1
#SBATCH -n 32
#SBATCH -t hh:mm:ss
#SBATCH -p gpu_partition_name
#SBATCH --gres=gpu:2
#SBATCH -A your_allocation_name

commands to run

Please note that the valid values for the number of tasks ("-n") is between 1 and (number of GPU requested)*(total number of CPU cores on the node)/(total number of GPUs on the node). For instance, if a job request 2 GPUs on a node with 64 cores and 4 GPUs, the valid value for "-n" is from 1 to 2*64/4=32.

Requesting One GPU Node With All Its GPUs

For an interactive session requesting one GPU node with all its GPUs (either 2 or 4, depending on the node configuration):

salloc -t hh:mm:ss -N1 -n64 --gres=gpu:number_of_gpus -p gpu_partition_name -A your_allocation_name

For a batch job requesting one GPU node and all its GPUs:

#!/bin/bash
#SBATCH -N 1
#SBATCH -n 64
#SBATCH -t hh:mm:ss
#SBATCH -p gpu_partition_name
#SBATCH --gres=gpu:number_of_gpus
#SBATCH -A your_allocation_name

commands to run

Requesting Multiple GPU Nodes With All Their GPUs

For an interactive session requesting multiple GPU nodes with all their GPUs (either 2 or 4, depending on the node configuration):

salloc -t hh:mm:ss -N number_of_gpu_nodes --gres=gpu:number_of_gpus -p gpu_partition_name -A your_allocation_name

For a batch job requesting multiple GPU nodes with all their GPUs:

#!/bin/bash
#SBATCH -N number_of_gpus_nodes
#SBATCH -t hh:mm:ss
#SBATCH -p gpu_partition_name
#SBATCH --gres=gpu:number_of_gpus
#SBATCH -A your_allocation_name

commands to run

Please note that, the value of "number_of_gpus" is the number of GPUs PER NODE, not the total number of GPUs that will be allocated to the job. For instance, when requesting 2 nodes with 4 GPUs on each node, the flag should be "--gres=gpu:4".

Commonly used SLURM Commands

squeue is used to show the partition (queue) status. Useful options:
- -l ("l" for "long"): gives more verbose information
- -u someusername: limit output to jobs by username --state=pending: limit output to pending (i.e. queued) jobs --state=running: limit output to running jobs
Below is an example to query all jobs submitted by current user (fchen14)
```
[fchen14@philip2 ~]$ squeue -u $USER
     JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
       340   checkpt     bash  fchen14  R    1:06:59      1 philip002
       339   checkpt     bash  fchen14  R    1:07:09      1 philip001
```

sinfo is used to view information about SLURM nodes and partitions. Typical usage:

[fchen14@philip001 test]$ sinfo
PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST
debug        up   infinite      3   idle philip[026-027,032]
checkpt*     up 3-00:00:00      2  alloc philip[001-002]
checkpt*     up 3-00:00:00     23   idle philip[003-025]
single       up 7-00:00:00      2  alloc philip[001-002]
single       up 7-00:00:00     23   idle philip[003-025]
bigmem       up 7-00:00:00      2   idle philip[033-034]

scancel is used to signal or cancel jobs. Typical usage with squeue:

[fchen14@philip1 ~]$ squeue -u fchen14
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
               341   checkpt     bash  fchen14  R       0:13      1 philip001
               340   checkpt     bash  fchen14  R    1:50:57      1 philip002
# cancel (delete) job with JOBID 340			   
[fchen14@philip1 ~]$ scancel 340
# job status might display a temporary "CG" ("CompletinG") status immediately after scancel
[fchen14@philip1 ~]$ squeue -u fchen14 
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
               340   checkpt     bash  fchen14 CG    1:51:08      1 philip002
               341   checkpt     bash  fchen14  R       0:41      1 philip001
[fchen14@philip1 ~]$ squeue -u fchen14 
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
               341   checkpt     bash  fchen14  R       1:08      1 philip001

scontrol is used to view or modify SLURM configuration and state. Typical usage for the user is to check job status:

[fchen14@philip1 ~]$ squeue -u fchen14 # show all jobs
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
               341   checkpt     bash  fchen14  R    1:29:20      1 philip001
[fchen14@philip1 ~]$ scontrol show job 341
JobId=341 JobName=bash
   UserId=fchen14(32584) GroupId=Admins(10000) MCS_label=N/A
   Priority=1 Nice=0 Account=hpc_hpcadmin6 QOS=normal
   JobState=RUNNING Reason=None Dependency=(null)
   Requeue=1 Restarts=0 BatchFlag=0 Reboot=0 ExitCode=0:0
   RunTime=01:29:31 TimeLimit=12:00:00 TimeMin=N/A
   SubmitTime=2020-05-07T10:47:52 EligibleTime=2020-05-07T10:47:52
   AccrueTime=Unknown
   StartTime=2020-05-07T10:47:52 EndTime=2020-05-07T22:47:57 Deadline=N/A
   SuspendTime=None SecsPreSuspend=0 LastSchedEval=2020-05-07T10:47:52
   Partition=checkpt AllocNode:Sid=philip1:28374
   ReqNodeList=(null) ExcNodeList=(null)
   NodeList=philip001
   BatchHost=philip001
   NumNodes=1 NumCPUs=8 NumTasks=1 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
   TRES=cpu=8,mem=22332M,node=1,billing=8
   Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=*
   MinCPUsNode=1 MinMemoryNode=22332M MinTmpDiskNode=0
   Features=(null) DelayBoot=00:00:00
   OverSubscribe=NO Contiguous=0 Licenses=(null) Network=(null)
   Command=/bin/bash
   WorkDir=/home/fchen14/test
   Power=

More detailed information on the SLURM commands to schedule and monitor jobs can be found at Slurm on-line documentation.

Job Templates for Serial and Parallel (Multi-Threaded and MPI) jobs

Serial Job

#!/bin/bash
#SBATCH --job-name=serial_job_test    # Job name
#SBATCH --ntasks=1                    # Using a single core
#SBATCH --time=00:10:00               # Time limit hh:mm:ss
#SBATCH --output=serial_test_%j.log   # Standard output and error log

module load python

echo "Running job on a single CPU core"

python /home/user/single_core_job.py

date

Shared Memory Parallelism (SMP) Jobs

Shared-Memory Parallelism (SMP) is when workload is shared among different CPU cores using multiple threads or processes running within a single compute node and these cores have access to common (shared) memory. The SMP applications can use OpenMP (Open Multi-Processing), pthreads, Python’s multiprocessing module, R's mcapply all fall into this category. While they can use multiple cores, they cannot make use of multiple nodes and all the cores must be physically located the same node. When running SMP jobs, you must make the SMP application aware of how many cores to use. How that is done depends on the specific application:

The OpenMP applications check the OMP_NUM_THREADS environment variable to determine how many threads to create (how many cores to use). You must set --ntasks=1, and then set OMP_NUM_THREADS to a value less than or equal to the number of cpus-per-task, typically, set --cpus-per-task to the number of OpenMP threads you wish to use.
For other types of applications, there could be different ways to to specify the number of cores to use (e.g., through particular command line arguments), please refer to the software documents for detailed information.

Below is an example for running SMP jobs:

#!/bin/bash
#SBATCH --job-name=parallel_job      # Job name
#SBATCH --nodes=1                    # Run all processes on a single node	
#SBATCH --ntasks=1                   # Run a single task		
#SBATCH --cpus-per-task=4            # Number of CPU cores per task
#SBATCH --time=00:10:00              # Time limit hh:mm:ss
#SBATCH --output=parallel_%j.log     # Standard output and error log

date
# use this line if your job uses OpenMP 
export OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK 

/home/user/smp_job.out
date

MPI (Message Passing Interface) Job

According to Slurm documentation, "there are three fundamentally different modes of operation used by various MPI implementation with Slurm:

Slurm directly launches the tasks and performs initialization of communications through the PMI2 or PMIx APIs. (Supported by most modern MPI implementations.)

Use mpirun launches tasks using Slurm's infrastructure (not using PMIx).

Slurm creates a resource allocation for the job and then mpirun launches tasks using some mechanism other than Slurm." (We do not recommend HPC/LONI users use this method to launch their MPI jobs.)

PMIx Versions

If you compiled your MPI application using our default mvapich2 libraries (which is compiled with PMIx enabled), you should start the application directly using the srun command. Below is an example job script with the executable a.out compiled using mvapich2 and launched using the srun command:

#!/bin/bash
#SBATCH --job-name=mpi_job_test      # Job name
#SBATCH --partition=workq            # For jobs using more than 1 node, submit to workq
#SBATCH --nodes=2                    # Number of nodes to be allocated
#SBATCH --ntasks=96                  # Number of MPI tasks (i.e. processes/cores)
#SBATCH --time=00:05:00              # Wall time limit (hh:mm:ss)
#SBATCH --output=mpi_test_%j.log     # Standard output and error

echo "Date              = $(date)"
echo "Hostname          = $(hostname -s)"
echo "Working Directory = $(pwd)"
echo ""
echo "Slurm Nodes Allocated          = $SLURM_JOB_NODELIST"
echo "Number of Nodes Allocated      = $SLURM_JOB_NUM_NODES"
echo "Number of Tasks Allocated      = $SLURM_NTASKS"

module load mvapich2/2.3.3/intel-19.0.5
srun -n $SLURM_NTASKS ./a.out

Non-PMIx Versions

If your MPI application did not use our default module key mvapich2/2.3.3/intel-19.0.5, you should start the application using the mpirun command. Below is an example job script with the executable a.out compiled using mvapich2/2.3.3/intel-19.0.5-hydra and launched using the mpirun command:

#!/bin/bash
#SBATCH --job-name=mpi_job_test      # Job name
#SBATCH --partition=workq            # For jobs using more than 1 node, submit to workq
#SBATCH --nodes=2                    # Number of nodes to be allocated
#SBATCH --ntasks=96                  # Number of MPI tasks (i.e. processes/cores)
#SBATCH --time=00:05:00              # Wall time limit (hh:mm:ss)
#SBATCH --output=mpi_test_%j.log     # Standard output and error

echo "Date              = $(date)"
echo "Hostname          = $(hostname -s)"
echo "Working Directory = $(pwd)"
echo ""
echo "Slurm Nodes Allocated          = $SLURM_JOB_NODELIST"
echo "Number of Nodes Allocated      = $SLURM_JOB_NUM_NODES"
echo "Number of Tasks Allocated      = $SLURM_NTASKS"

module load mvapich2/2.3.3/intel-19.0.5-hydra
mpirun -n $SLURM_NTASKS ./a.out

Hybrid (MPI + SMP) Job

Hybrid jobs are MPI applications where each MPI process is multi-threaded (usually via either OpenMP or POSIX Threads) and can use multiple cores across multiple nodes. If the MPI implementation is compiled with PMIx enabled, use the srun command to start the hybrid job, otherwise, use the mpirun command to start it.

PMIx Versions

On QB3, there are 48 CPU cores on each compute node. Below example requests 4 MPI process (tasks), each process will spawn 24 threads on 24 cores, thus a total of 96 cores will be used, running one thread on each core from 2 nodes in workq using the module key mvapich2/2.3.3/intel-19.0.5 compiled with PMIx enabled.

#!/bin/bash
#SBATCH --job-name=hybrid_job_test   # Job name
#SBATCH --partition=workq            # Need to submit workq for multiple node jobs
#SBATCH --nodes=2                    # Maximum number of nodes to be allocated
#SBATCH --ntasks=4                   # Number of MPI tasks (i.e. processes)
#SBATCH --cpus-per-task=24           # Number of cores per MPI task
#SBATCH --time=00:05:00              # Wall time limit (hh:mm:ss)
#SBATCH --output=hybrid_test_%j.log  # Standard output and error file

echo "Date              = $(date)"
echo "Hostname          = $(hostname -s)"
echo "Working Directory = $(pwd)"
echo ""
echo "Number of Nodes Allocated      = $SLURM_JOB_NUM_NODES"
echo "Number of Tasks Allocated      = $SLURM_NTASKS"
echo "Number of Cores/Task Allocated = $SLURM_CPUS_PER_TASK"

module load mvapich2/2.3.3/intel-19.0.5
srun -n $SLURM_NTASKS -c $SLURM_CPUS_PER_TASK ./a.out

Non-PMIx Versions

Similar to the above, below example requests 4 tasks, each with 24 cores, thus a total of 96 cores will be used from 2 nodes in workq, but it uses the module key mvapich2/2.3.3/intel-19.0.5-hydra without PMIx enabled, so the mpirun command is used to launch ./a.out, and OMP_NUM_THREADS is specified in the job script to determine the number of threads used for each process.

#!/bin/bash
#SBATCH --job-name=hybrid_job_test      # Job name
#SBATCH --partition=workq            # Need to submit workq for multiple node jobs
#SBATCH --nodes=2                    # Maximum number of nodes to be allocated
#SBATCH --ntasks=4                   # Number of MPI tasks (i.e. processes)
#SBATCH --cpus-per-task=24           # Number of cores per MPI task
#SBATCH --time=00:05:00              # Wall time limit (hh:mm:ss)
#SBATCH --output=hybrid_test_%j.log  # Standard output and error file

echo "Date              = $(date)"
echo "Hostname          = $(hostname -s)"
echo "Working Directory = $(pwd)"
echo ""
echo "Number of Nodes Allocated      = $SLURM_JOB_NUM_NODES"
echo "Number of Tasks Allocated      = $SLURM_NTASKS"
echo "Number of Cores/Task Allocated = $SLURM_CPUS_PER_TASK"

module load mvapich2/2.3.3/intel-19.0.5-hydra
export OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK
mpirun -n $SLURM_NTASKS ./a.out

Submitting Multiple Dependent Jobs

Job dependencies are used to defer the start of a job until the specified dependent jobs have completed. They are specified with the --dependency option to the sbatch command using the below format:

sbatch --dependency=<type:job_id[:job_id][,type:job_id[:job_id]]> ...

Before trying to use dependent jobs, please first note that overhead for starting and stopping a job in SLURM is very high (e.g.,the scheduler needs to allocation node resources, check the nodes, start your job, after the job commands are done, the nodes need to be retrieved by the scheduler for the next job. Therefore if your jobs use the same configuration (i.e., same number of nodes and cores), instead of using dependent jobs, use a single job and run the dependent commands/tasks sequentially. It is much better to have less but longer running jobs.

Below is an example for submitting three jobs job1.sh, job2.sh and job3.sh. job3.sh will depend on the completion of job1.sh and job2.sh, in this very simple example, job1.sh and job2.sh first sleep for a few seconds and then output their job-id $SLURM_JOBID to a file named "depfile", job3.sh will display the content of "depfile" and ensure job1.sh and job2.sh are both completed:

job1.sh:

#!/bin/bash
#SBATCH --time 1:00:00
#SBATCH --nodes 1

sleep 10 # sleep 10 seconds
echo $SLURM_JOBID >> depfile # output job-id to depfile

exit

job2.sh:

#!/bin/bash
#SBATCH --time 1:00:00
#SBATCH --nodes 1

sleep 5 # sleep 5 seconds, on an idle cluster with at least 2 nodes, job2.sh will finish before job1.sh
echo $SLURM_JOBID >> depfile # output job-id to depfile

exit

job3.sh:

#!/bin/bash
#SBATCH --time 1:00:00
#SBATCH --nodes 1

# show content of depfile, it should have the job-id of both job1.sh and job2.sh
cat depfile 

exit

We use the below script to submit the three jobs from the login node, by using the --dependency option in slurm, job3.sh will start after job1.sh and job2.sh are both completed.

submit.sh:

#!/bin/bash
# Do NOT submit this script using sbatch!
# use the below comand to get the job-id of the first job
# the sbatch will output a line containing the job-id just submitted
# we use the cut command to get the job-id (last field)

JOBID1=$( sbatch job1.sh | cut -d' ' -f4 )
echo "Submitted batch job $JOBID1"

JOBID2=$( sbatch job2.sh | cut -d' ' -f4 )
echo "Submitted batch job $JOBID2"

# job3.sh depends on the completion of job1.sh and job2.sh
sbatch --dependency=afterok:$JOBID1:$JOBID2 job3.sh

We then run the submit.sh bash script to submit the three dependent jobs, note that this script is *NOT* a job script so do *NOT* submit it using sbatch.

[fchen14@philip1 slurmdoc]$ ./submit.sh
Submitted batch job 27
Submitted batch job 28
Submitted batch job 29
# check the job status using squeue, note the (Dependency) flag for job-id 29.
[fchen14@philip1 slurmdoc]$ squeue -u fchen14
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
                27   checkpt  job1.sh  fchen14 CF       0:04      1 philip011
                28   checkpt  job2.sh  fchen14 CF       0:04      1 philip012
                29   checkpt  job3.sh  fchen14 PD       0:00      1 (Dependency)
# job2.sh (job-id=28) finishes first
[fchen14@philip1 slurmdoc]$ squeue -u fchen14
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
                29   checkpt  job3.sh  fchen14 PD       0:00      1 (Dependency)
                27   checkpt  job1.sh  fchen14  R       0:14      1 philip011
# job3.sh starts after job1.sh (job-id=27) is finished
[fchen14@philip1 slurmdoc]$ squeue -u fchen14
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
                29   checkpt  job3.sh  fchen14 CF       0:04      1 philip011
# all three jobs are finished
[fchen14@philip1 slurmdoc]$ squeue -u fchen14
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
# the output file of job 29 (job-id 29) shows job-ids of both job1.sh and job2.sh
[fchen14@philip1 slurmdoc]$ cat slurm-29.out
JOB2ID=28
JOB1ID=27

PBS to SLURM Translation

Below tables show common PBS (Moab/Torque) to Slurm translation, :

**PBS (Moab/Torque) to Slurm commands**
Action	PBS (Moab/Torque)	SLURM
Job Submission	qsub jobscript	sbatch jobscript
List user jobs	qstat -u $USER	squeue -u $USER
Job deletion	qdel <job-id>	scancel <job-id>
Check available queue	qstat -q	sinfo
Check job status	checkjob <job-id>	scontrol show job <job-id>

**PBS (Moab/Torque) to Slurm directives (Special comments)**
Directive (Special comments)	PBS (Moab/Torque)	SLURM
Walltime (time limit)	#PBS -l walltime=2:00:00	#SBATCH -t 1:00:00 (or --time=2:00:00)
Node/Process count	#PBS -l nodes=2:ppn=8	#SBATCH -N 2 (or --nodes 2) #SBATCH --ntasks-per-node 8
Partition (Queue)	#PBS -q checkpt	#SBATCH -p checkpt
Allocation	#PBS -A your_allocation_name	#SBATCH -A your_allocation_name
Email address	#PBS -M your@email.address	#SBATCH --mail-user your@email.address
Email options	#PBS -m abe	#SBATCH --mail-type FAIL,BEGIN,END,ALL More options see 'man sbatch'
JobName	#PBS -N jobname	#SBATCH -J jobname
Job output	#PBS -o filename.out #PBS -e filename.err #PBS -j oe	#SBATCH -o filename.out #SBATCH -e filename.err SLURM merges stdout and stderr by default

**PBS (Moab/Torque) to Slurm directives (Special comments)**
Description	PBS (Moab/Torque)	SLURM
Job ID	$PBS_JOBID	$SLURM_JOBID
Node list	$PBS_NODEFILE	$SLURM_JOB_NODELIST
Job submit directory	$PBS_O_WORKDIR	$SLURM_SUBMIT_DIR
Number of nodes	$PBS_NUM_NODES	$SLURM_NNODES
Number of CPU-cores (tasks)	$PBS_NP	$SLURM_NTASKS

High Performance Computing

Louisiana State University

Submitting Jobs using SLURM on Linux Clusters

Submitting batch script (single node)

Submitting batch script (multiple nodes)

Submitting interactive jobs

Jobs Using GPUs

Requesting One GPU

Requesting More Than One GPU (But Less Than A Node)

Requesting One GPU Node With All Its GPUs

Requesting Multiple GPU Nodes With All Their GPUs

Commonly used SLURM Commands

Job Templates for Serial and Parallel (Multi-Threaded and MPI) jobs

Serial Job

Shared Memory Parallelism (SMP) Jobs

MPI (Message Passing Interface) Job

PMIx Versions

Non-PMIx Versions

Hybrid (MPI + SMP) Job

PMIx Versions

Non-PMIx Versions

Submitting Multiple Dependent Jobs

PBS to SLURM Translation