Moving from Grid Engine to Slurm¶
This page is for users moving from Grid Engine to Slurm
The information below is only useful for users who have already used Grid Engine on Apocrita and want to know how to migrate their existing knowledge to Slurm. If you have never used Grid Engine before, stop reading and return to the main documentation.
The use of Slurm on Apocrita represents a significant change for users who are used to using the Univa Grid Engine (UGE) job scheduler.
Whilst UGE served us well, Slurm has been widely adopted by many other HPC sites, is under active development and has features and flexibility that we need as we introduce new platforms for the research community at the University.
This page shows Slurm commands and job script options next to their UGE counterparts to help you move from UGE to Slurm.
Job script header¶
UGE (#$) vs Slurm (#SBATCH)¶
The move from UGE to Slurm means your previous Apocrita job scripts will no longer work.
This is because the Apocrita job script header lines beginning with #$ will be
ignored by Slurm. Instead, you should use lines beginning with #SBATCH, and
you will need to convert the options you use on those lines from UGE to Slurm.
Watch out for $!
Note that it is #SBATCH (i.e. with the letter "S" in
capitals, short for "Slurm Batch") and not #$BATCH
(a dollar ("$") symbol). This is an easy mistake to make when you begin to
convert your UGE job scripts. Do not use a $ (dollar) symbol in your
Slurm job script header.
Examples of UGE job scripts and their equivalent Slurm job script are given below.
The commands used to submit jobs and check on the queue have also changed. See below for the equivalent commands.
Job submission and management¶
UGE (qsub, …) vs Slurm (sbatch, …)¶
| UGE Commands | Slurm Commands |
|---|---|
# Batch job submission |
# Batch job submission |
qsub job_script |
sbatch job_script |
qsub job_script arg1 arg2 ... |
sbatch job_script arg1 arg2 ... |
# Job queue status |
# Job queue status |
qstat # Show your jobs (if any) |
squeue --me |
qstat -u "*" # Show all jobs |
squeue |
qstat -u username |
squeue -u username |
# Cancel (delete) a job |
# Cancel (delete) a job |
qdel jobid |
scancel jobid |
qdel jobname |
scancel -n jobname |
qdel jobid -t taskid |
scancel jobid_taskid |
qdel "*" # Delete all my jobs |
scancel --me # Delete all my jobs |
# Interactive job |
# Interactive job |
qlogin |
salloc |
# Completed job accounting |
# Completed job accounting |
qacct -j jobid |
sacct -j jobid |
Job scripts¶
Use tasks (-n) and not cores per task (-c)
Slurm refers to CPUs as "tasks" and you should request the number of CPUs
you require in most job scripts using -n or --ntasks, as per the
examples below. The -c option is for number of CPUs required per task, and
should only normally be used for advanced jobs, such as those combining Open
MPI ranks and OpenMP threads for example.
You will need to rewrite your UGE job scripts. We advise taking a copy of any
existing UGE script and naming it something like job.slurm or
job.sbatch, to make it obvious it is a Slurm job script.
Put #SBATCH lines in one block¶
Please note: all Slurm job script header lines beginning with #SBATCH must
come before ordinary lines that run job commands or your application. Any
#SBATCH lines appearing after the first non-#SBATCH line will be ignored.
For example:
#!/bin/bash
#SBATCH -p compute # (or --partition=compute)
#SBATCH -n 1 # (or --ntasks=1) Request 1 core
#SBATCH -t 1:0:0 # Request 1 hour runtime
#SBATCH --mem-per-cpu=1G # Request 1GB RAM per core
# Now the first "ordinary" line. No more #SBATCH lines would be processed if
# they were added after this
export MY_DATA=/gpfs/scratch/${USER}/data
module load app
app <arguments>
Please note that under Slurm, --mem-per-cpu must be set to an integer,
i.e. for 7.5GB ram per core, you must use --mem-per-cpu=7500M and not
--mem-per-cpu=7.5G.
You can also use Slurm's internal srun
command to run your application/job commands as a separate Slurm job step:
srun app <arguments>
Please refer to the
official Slurm documentation for
more information about the srun command. We recommend most users stick to
simple job scripts that don't use srun within them initially.
Serial job script (single-core)¶
Note that in Slurm you must specify one core to be safe – some job scripts will
need the $SLURM_NTASKS environment variable (equivalent of UGE’s $NSLOTS
variable) and Slurm only sets it if you explicitly request one core.
#!/bin/bash
#$ -cwd # Set the working directory for the job to the current directory
#$ -pe smp 1 # Request 1 core
#$ -l h_rt=1:0:0 # Request 1 hour runtime
#$ -l h_vmem=1G # Request 1GB RAM per core
# Module load
module load app
# Run application
app \
--input in.dat \
--output out.dat
#!/bin/bash
# The working directory for the job is
# the current directory by default in
# Slurm
#SBATCH -n 1 # (or --ntasks=1) Request 1 core
# OPTIONAL LINE: default partition is
# compute
#SBATCH -p compute # (or --partition=compute)
#SBATCH -t 1:0:0 # Request 1 hour runtime
#SBATCH --mem-per-cpu=1G # Request 1GB RAM per core
# Module load
module load app
# Run application
app \
--input in.dat \
--output out.dat
For more detailed examples of single-core serial job scripts, please see the main documentation.
Serial job script (multi-core)¶
#!/bin/bash
#$ -cwd # Set the working directory for the job to the current directory
#$ -pe smp 4 # Request 4 CPU cores
#$ -l h_rt=1:0:0 # Request 1 hour runtime
#$ -l h_vmem=1G # Request 1GB RAM / core, i.e. 4GB total
# Module load
module load app
# Using $NSLOTS for threading
app \
--threads ${NSLOTS} \
--input in.dat \
--output out.dat
#!/bin/bash
# The working directory for the job is
# the current directory by default in
# Slurm
#SBATCH -n 4 # (or --ntasks=4) # Request 4 CPU cores
# OPTIONAL LINE: default partition is
# compute
#SBATCH -p compute # (or --partition=compute)
#SBATCH -t 1:0:0 # Request 1 hour runtime
#SBATCH --mem-per-cpu=1G # Request 1GB RAM / core, i.e. 4GB total
# Module load
module load app
# Using $SLURM_NTASKS for threading
app \
--threads ${SLURM_NTASKS}
--input in.dat \
--output out.dat
For more detailed examples of multi-core serial job scripts, please see the main documentation.
Parallel job script¶
Request the right resources and partition
Parallel jobs must request the parallel partition and request at least
two nodes. Jobs that fail to fulfil these requirements will be rejected by
Slurm.
Slurm exclusive requests must separately request exclusive RAM
On UGE, jobs submitted to the Apocrita parallel nodes would automatically
make an exclusive request for all CPUs and RAM on the nodes requested. On
Slurm, users need to request both --exclusive and --mem=0 as in the
example below. For more information, please see the
official Slurm documentation.
#!/bin/bash
#$ -cwd # Set the working directory for the job to the current directory
#$ -pe parallel 96 # Request 96 cores/2 ddy nodes
#$ -l infiniband=ddy-i # Choose infiniband island (ddy-i)
#$ -l h_rt=240:0:0 # Request 240 hours runtime
# This is automatically added to all UGE
# parallel jobs on Apocrita
#$ -l exclusive # Request all resources on node
# Module load
module load openmpi
# UGE needs to be explicitly given the
# number of ranks to use
# (usually via $NSLOTS)
mpirun \
-np ${NSLOTS} \
./code \
-i input.file
#!/bin/bash
# The working directory for the job is
# the current directory by default in
# Slurm
#SBATCH -N 2 # (or --nodes=2) # Request 2 ddy nodes
#SBATCH -n 96 # (or --ntasks=96) # Request 96 cores
#SBATCH -p parallel # (or --partition=parallel)
#SBATCH -t 240:0:0 # Request 240 hours runtime
# Both arguments required for exclusive
# use of CPUs and memory on all nodes
#SBATCH --exclusive
#SBATCH --mem=0
# Module load
module load openmpi
# Slurm knows how many cores to use for
# mpirun, detected automatically from
# ${SLURM_NTASKS}. Use -- to ensure
# arguments are passed to application
# and not mpirun
mpirun \
-- \
./code \
-i input.file
For more detailed examples of parallel job scripts, please see the main documentation.
Use mpirun instead of srun --mpi
The
official Open MPI documentation
recommends using mpirun for all MPI processes under Slurm and not
srun.
Array job script¶
#!/bin/bash
#$ -cwd
#$ -pe smp 1
#$ -l h_vmem=1G
#$ -j y
#$ -l h_rt=1:0:0
#$ -t 1-3
echo ${SGE_TASK_ID}
#!/bin/bash
#SBATCH -n 1
#SBATCH --mem-per-cpu=1G
#SBATCH -t 1:0:0
#SBATCH -a 1-3
echo ${SLURM_ARRAY_TASK_ID}
For more detailed examples of array job scripts, please see the main documentation.
GPU job script¶
#!/bin/bash
#$ -cwd
#$ -j y
#$ -pe smp 8 # 8 cores (8 cores per GPU)
#$ -l h_rt=1:0:0 # 1 hour runtime
#$ -l h_vmem=11G # 11 * 8 = 88G total RAM
#$ -l gpu=1 # request 1 GPU
./run_code.sh
#!/bin/bash
#SBATCH -p gpushort # (or --partition=gpushort)
#SBATCH -n 8 # 8 cores (8 cores per GPU)
#SBATCH -t 1:0:0 # 1 hour runtime
#SBATCH --mem-per-cpu=11G # 11 * 8 = 88G total RAM
#SBATCH --gres=gpu:1 # request 1 GPU
./run_code.sh
For more detailed examples of different GPU job script types, please see the main GPU documentation.
More job script header options – UGE vs Slurm¶
Job Output Files¶
| UGE job output files | SLURM job output files |
|---|---|
Individual (non-array) jobs |
Individual (non-array) jobs |
job scriptname.oJOBID |
slurm-JOBID.out |
job scriptname.eJOBID |
|
Array jobs |
Array jobs |
job scriptname.oJOBID.TASKID |
slurm-ARRAYJOBID_TASKID.out |
job scriptname.eJOBID.TASKID |
By default, Slurm will generate a single job output file containing the STDOUT
and STDERR output that UGE splits into two files by default. If you were
already using "-j y" in your UGE jobs, this will be familiar.
The naming and merging of the files can be changed using job script options, and
we highly recommend you give your Slurm jobs a name and then manually define
your job output file(s) naming scheme to match, to avoid all your output files
starting with "slurm-".
Single combined job output file¶
For a single output file (which would be named "jobname.o<job number>"):
#SBATCH -J jobname
#SBATCH -o %x.o%j
Example for a job called "jobname" with the job number 1234567:
jobname.o1234567
Separate job error and output files¶
For separate output files for STDOUT and STDERR (which would be named
"jobname.o<job number>" and "jobname.e<job number>" respectively):
#SBATCH -J jobname
#SBATCH -o %x.o%j
#SBATCH -e %x.e%j
Example for a job called "jobname" with the job number 1234567:
jobname.e1234567
jobname.o1234567
UGE vs SLURM output naming options¶
#!/bin/bash
...
# Naming the job is optional.
# Default is name of job script
# DOES rename .o and .e output files.
#$ -N jobname
# Naming the output files is optional.
# Default is separate .o and .e files:
# jobname.oJOBID and jobname.eJOBID
# Use of '-N jobname' DOES affect those defaults
#$ -o myjob.out
#$ -e myjob.err
# To join .o and .e in to a single file
# similar to Slurm's default behaviour:
#$ -j y
#!/bin/bash
...
# Naming the job is optional.
# Default is name of job script
# Does NOT rename .out file.
#SBATCH -J jobname
# Naming the output files is optional.
# Default is a single file for .o and .e:
# slurm-JOBID.out
# Use of '-J jobname' does NOT affect the default
#SBATCH -o myjob.out
#SBATCH -e myjob.err
# Use wildcards to recreate the UGE names
# %x = $SLURM_JOB_NAME
# %j = $SLURM_JOB_ID
#SBATCH -o %x.o%j
#SBATCH -e %x.e%j
For further naming options, please refer to the official Slurm documentation.
The $SLURM_JOB_NAME variable will tell you the name of your job script, unless
the -J jobname variable is used to rename your job. Then the environment
variable is set to the value of jobname.
If you wanted to use $SLURM_JOB_NAME to always give you the name of the job
script from within your job, you would have to remove the -J flag. However,
the following command run inside your job script will give you the name of the
job script regardless of whether you use the -J flag or not:
scontrol show jobid ${SLURM_JOB_ID} | grep Command= | awk -F/ '{print $NF}'
Renaming array job output files¶
An array job uses slurm-*ARRAYJOBID_TASKID*.out as the default output file for
each task in the array job. This can be renamed but you need to use the %A and
%a wildcards (not %j).
#!/bin/bash
...
# An array job (cannot start at 0)
#$ -t 1-1000
# Naming the job is optional.
# Default is name of job script
#$ -N jobname
# Naming the output files is optional.
# Default is separate .o and .e files:
# jobname.oJOBID and jobname.eJOBID
# Use of '-N jobname' DOES affect those defaults
# To join .o and .e in to a single file
# similar to Slurm's default behaviour:
#$ -j y
#!/bin/bash
...
# An array job (CAN start at 0)
#SBATCH -a 0-999 # (or --array=0-999)
# Naming the job is optional.
# Default is name of job script
#SBATCH -J jobname
# Naming the output files is optional.
# Default is a single file for .o and .e:
# slurm-ARRAYJOBID_TASKID.out
# Use of '-J jobname' does NOT affect
# the default
# Use wildcards to recreate the UGE names
# %x = $SLURM_JOB_NAME
# %A = $SLURM_ARRAY_JOB_ID
# %a = $SLURM_ARRAY_TASK_ID
#SBATCH -o %x.o%A.%a
#SBATCH -e %x.e%A.%a
Emailing from a job¶
Slurm can email you when your job begins, ends or fails.
#!/bin/bash
...
# Mail events: begin, end, abort
#$ -m bea
#$ -M emailaddr@qmul.ac.uk
#!/bin/bash
...
# Mail events: NONE, BEGIN, END, FAIL, ALL
#SBATCH --mail-type=ALL
#SBATCH --mail-user=emailaddr@qmul.ac.uk
Note that in Slurm, array jobs only send one email, not an email per job-array
task as happens in UGE. If you want an email from every job-array task, add
ARRAY_TASKS to the --mail flag:
#SBATCH --mail-type=ALL,ARRAY_TASKS
#
# DO NOT USE IF YOUR ARRAY JOB CONTAINS MORE THAN
# 20 TASKS!!
But please be aware that you will receive lots of emails if you run a large job array with this flag enabled.
Job Environment Variables¶
A number of environment variables are available for use in your job scripts – these are sometimes useful when creating your own log files, for informing applications how many cores they are allowed to use (we’ve already seen $SLURM_NTASKS in the examples above), and for reading sequentially numbered data files in job arrays.
| UGE Environment Variables | Slurm Environment Variables |
|---|---|
$NSLOTS # Num cores reserved |
$SLURM_NTASKS # Num cores from -n flag |
$SLURM_CPUS_PER_TASK # Num cores from -c flag |
|
$JOB_ID # Unique jobid number |
$SLURM_JOB_ID # Unique job id number |
$JOB_NAME # Name of job |
$SLURM_JOB_NAME # Name of job |
# For array jobs |
# For array jobs |
$JOB_ID # Same for all tasks |
$SLURM_JOB_ID # DIFFERENT FOR ALL TASKS |
# (e.g, 20173) |
# (e.g, 20173,20174,20175,) |
$SLURM_ARRAY_JOB_ID # SAME for all tasks |
|
# (e.g, 20173) |
|
$SGE_TASK_ID # Job array task number |
$SLURM_ARRAY_TASK_ID # Job array task number |
# (e.g., 1,2,3,...) |
# (e.g., 1,2,3,...) |
$SGE_TASK_FIRST # First task id |
$SLURM_ARRAY_TASK_MIN # First task id |
$SGE_TASK_LAST # Last task id |
$SLURM_ARRAY_TASK_MAX # Last task id |
$SGE_TASK_STEPSIZE # Taskid increment: default 1 |
$SLURM_ARRAY_TASK_STEP # Increment: default 1 |
$SLURM_ARRAY_TASK_COUNT # Number of tasks |
|
# Others |
# Others |
$PE_HOSTFILE # Multi-node job host list |
$SLURM_JOB_NODELIST # Multi-node job host list |
$NHOSTS # Number of nodes in use |
$SLURM_JOB_NUM_NODES # Number of nodes in use |
$SGE_O_WORKDIR # Submit directory |
$SLURM_SUBMIT_DIR # Submit directory |
Many more environment variables are available for use in your job script. The
official Slurm documentation
(also available by running man sbatch) documents input and output
environment variables. The input variables can be set by you before
submitting a job to set job options (although we recommend not doing this – it
is better to put all options in your job script so that you have a permanent
record of how you ran the job). The output variables can be used inside your
job script to get information about the job (e.g., number of cores, job name and
so on – we have documented several of these above.)