Monitoring jobs¶

Checking the status of submitted jobs¶

To monitor jobs you have submitted to the queue, you can run the squeue --me command. The below example assumes your username is abc123.

$ squeue --me
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
            123456   compute interact   abc123 PD       0:00      1 (Resources)
            123450   compute interact   abc123  R    1:37:15      1 ddy145
            123461   compute interact   abc123  R    2:12:07      1 ehc4

Excluding --me will show queued and running jobs by all users.

Job states¶

Your job will be in one of the following states when running squeue:

R - The job is currently running.
PD - The job has been submitted and is waiting in the queue.
CG - The job is in the process of completing.

Once the job has finished, it will likely have one of the following states (shown via sacct or jobstats):

COMPLETED - The job has finished successfully (exit code 0)
FAILED - The job did not finish successfully (non-zero exit code)
OUT_OF_MEMORY - The job exceeded the requested memory
TIMEOUT - The job exceeded the requested runtime

See the official Slurm documentation for more information about job exit states and exit codes.

A job could be pending (PD) for a variety of reasons, and may not indicate an error with your job. The output of squeue will present the reason why the job is being held in the queue in the final column, within brackets.

See the official Slurm documentation for more information about job reason codes.

Viewing job details¶

To see more information about a specific job (for example, the requested resources and output directories), use scontrol show job JOB_ID:

$ scontrol show job 123450
JobId=123450 JobName=test_job
   UserId=abc123(UID) GroupId=group(GID) MCS_label=N/A
   Priority=1 Nice=0 Account=ACCOUNT QOS=normal
   JobState=RUNNING Reason=None Dependency=(null)
...

The scontrol show job command will only work whilst the job is queued, running or shortly after execution. After a short period of time, the job is written to the accounting database and wiped from the scheduler. At this point, the scontrol show job command will display the following error instead of the job details:

slurm_load_jobs error: Invalid job id specified

To display jobs that have been written to the accounting database, use sacct or jobstats instead.

Checking where my job is in the queue¶

We provide some additional commands to display current activity of the cluster and the queues in a more readable format than the standard squeue and sinfo commands.

nodestatus¶

nodestatus will show the current core and memory usage of all of the nodes. Note that some nodes are on partitions that may not be available to all users. An example output is shown below:

$ nodestatus

HOSTNAMES PARTITION      CPUS(A/I/O/T)  MEMORY    FREE_MEM  AVAIL_FEATURES      STATE          REASON
...
ddy19     compute*       48/0/0/48       376GB    338106    ddy                 allocated      none
ddy20     compute*       48/0/0/48       376GB    357888    ddy                 allocated      none
ddy21     compute*       48/0/0/48       376GB    354636    ddy                 allocated      none
ddy22     compute*       48/0/0/48       376GB    352217    ddy                 allocated      none
ddy23     compute*       46/2/0/48       376GB    338739    ddy                 mixed          none
ddy24     compute*       48/0/0/48       376GB    338195    ddy                 allocated      none
ddy25     compute*       48/0/0/48       376GB    296589    ddy                 allocated      none
ddy26     compute*       48/0/0/48       376GB    345574    ddy                 allocated      none
ddy27     compute*       47/1/0/48       376GB    335979    ddy                 mixed-         none
ddy28     compute*       48/0/0/48       376GB    305058    ddy                 allocated      none
...

The CPUS column reports the Allocated/Idle/Other/Total cores for each node.

The FREE_MEM column is the amount of unused memory on the node, as reported by the free command. This memory may be allocated to other jobs, and does not indicate available memory for new jobs.

Email notifications¶

To receive email notifications when your job status changes, add these options to your job script:

# send an email on specific job events (i.e. start, end, fail)
#SBATCH --mail-type=ALL

# The email address to notify
#SBATCH --mail-user=my_name@qmul.ac.uk

To receive email notifications only for specific job states, replace ALL with one (or more, comma-separated) of the available keywords. For example, use FAIL to be notified only when a job fails. See the full list of keywords in the official Slurm documentation.

If you do not add the --mail-user option, you will be emailed at the address that is registered with us (usually your QMUL email address).