Monitoring jobs¶
Checking the status of submitted jobs¶
To monitor jobs you have submitted to the queue, you can run the squeue --me
command. The below example assumes your username is abc123.
$ squeue --me
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
123456 compute interact abc123 PD 0:00 1 (Resources)
123450 compute interact abc123 R 1:37:15 1 ddy145
123461 compute interact abc123 R 2:12:07 1 ehc4
Excluding --me will show queued and running jobs by all users.
Job states¶
Your job will be in one of the following states when running squeue:
R- The job is currently running.PD- The job has been submitted and is waiting in the queue.CG- The job is in the process of completing.
Once the job has finished, it will likely have one of the following states
(shown via sacct or jobstats):
COMPLETED- The job has finished successfully (exit code 0)FAILED- The job did not finish successfully (non-zero exit code)OUT_OF_MEMORY- The job exceeded the requested memoryTIMEOUT- The job exceeded the requested runtime
See the official Slurm documentation for more information about job exit states and exit codes.
A job could be pending (PD) for a variety of reasons, and may not indicate
an error with your job. The output of squeue will present the reason why the
job is being held in the queue in the final column, within brackets.
See the official Slurm documentation for more information about job reason codes.
Viewing job details¶
To see more information about a specific job (for example, the requested
resources and output directories), use scontrol show job JOB_ID:
$ scontrol show job 123450
JobId=123450 JobName=test_job
UserId=abc123(UID) GroupId=group(GID) MCS_label=N/A
Priority=1 Nice=0 Account=ACCOUNT QOS=normal
JobState=RUNNING Reason=None Dependency=(null)
...
The scontrol show job command will only work whilst the job is queued,
running or shortly after execution. After a short period of time, the
job is written to the accounting database and wiped from the scheduler.
At this point, the scontrol show job command will display the following
error instead of the job details:
slurm_load_jobs error: Invalid job id specified
To display jobs that have been written to the accounting database, use
sacct or jobstats instead.
Checking where my job is in the queue¶
We provide some additional commands to display current activity of the cluster
and the queues in a more readable format than the standard squeue and sinfo
commands.
nodestatus¶
nodestatus will show the current core and memory usage of all of
the nodes. Note that some nodes are on partitions that may not be available to
all users. An example output is shown below:
$ nodestatus
HOSTNAMES PARTITION CPUS(A/I/O/T) MEMORY FREE_MEM AVAIL_FEATURES STATE REASON
...
ddy19 compute* 48/0/0/48 376GB 338106 ddy allocated none
ddy20 compute* 48/0/0/48 376GB 357888 ddy allocated none
ddy21 compute* 48/0/0/48 376GB 354636 ddy allocated none
ddy22 compute* 48/0/0/48 376GB 352217 ddy allocated none
ddy23 compute* 46/2/0/48 376GB 338739 ddy mixed none
ddy24 compute* 48/0/0/48 376GB 338195 ddy allocated none
ddy25 compute* 48/0/0/48 376GB 296589 ddy allocated none
ddy26 compute* 48/0/0/48 376GB 345574 ddy allocated none
ddy27 compute* 47/1/0/48 376GB 335979 ddy mixed- none
ddy28 compute* 48/0/0/48 376GB 305058 ddy allocated none
...
The CPUS column reports the Allocated/Idle/Other/Total cores for each
node.
The FREE_MEM column is the amount of unused memory on the node, as reported
by the free command. This memory may be allocated to other jobs, and does
not indicate available memory for new jobs.
Email notifications¶
To receive email notifications when your job status changes, add these options to your job script:
# send an email on specific job events (i.e. start, end, fail)
#SBATCH --mail-type=ALL
# The email address to notify
#SBATCH --mail-user=my_name@qmul.ac.uk
To receive email notifications only for specific job states, replace ALL with
one (or more, comma-separated) of the available keywords. For example, use
FAIL to be notified only when a job fails. See the full list of keywords in the
official Slurm documentation.
If you do not add the --mail-user option, you will be emailed at the address
that is registered with us (usually your QMUL email address).