Debugging your jobs¶

This page provides some general information on debugging jobs that are not submitting, running or completing. If you still cannot resolve the issue without assistance, please contact us, supplying all the relevant information.

Failure to submit¶

A job may be rejected by the scheduler and fail to submit if incorrect scheduler parameters are provided.

Check the following:

The memory request has been adjusted for the number of tasks requested
Requested resources are a reasonable value e.g. --time is less than 10 days
Specified scheduler parameters are valid for the partition(s) the job is being submitting to (i.e. requesting a GPU for all GPU partitions)
You have permission to access the resources you're requesting

In cases where the job fails to submit, the scheduler should provide a failure message. If you are unsure what this means, please contact us.

Failure to run¶

Jobs may wait in the queue for a long period depending on the availability of resources, they may also have incorrect resource requests that prevent them running.

To check the current status of your job(s), run squeue --me. This will show all queued and running jobs(s) submitted by you. Optionally, add the --states=pending option to only show queued jobs. See the official Slurm documentation for additional squeue options.

Whilst the job is queueing, the output of squeue will present the reason why the job is being held in the queue in the final column, within brackets.

A few common reasons and their explanations have been listed below:

Reason	Explanation
`(Resources)`	The job is waiting for cluster resources to become available
`(Priority)`	Another queued job with a higher priority exists for the requested partition(s)
`(Dependency)`	This job has a dependency on another job that has not been satisfied
`(QOSMaxCpuPerUserLimit)`	The CPU/task request exceeds the maximum quota applied per user
`(QOSMaxGRESPerUser)`	The GRES (GPU) request exceeds the maximum quota applied per user

See the official Slurm documentation for more information about job reason codes.

Please contact us if you are unsure why your job is queueing.

Failure to complete¶

A job may fail to run for a number of reasons:

lack of disk quota
bad characters in the script
insufficient resources requested, check the resource usage

Check the job output for the following:

syntax errors in your script
code failing to run and exiting with an error
code failing to run because an expected file or directory did not exist
permissions problem (can't read or write certain files)
mismatch between tasks requested for the job, and tasks used by the application. To avoid this you should use $SLURM_NTASKS to provide the correct number of tasks to the application.

If you're using software which needs a licence like ansys or matlab, check that obtaining a licence was successful.

Job exit state¶

All successful jobs will finish with the COMPLETED state (exit status 0). If your job does not exit with this state, it has failed somehow. Please check the job output file for more details.

You may also enable email options or check the the job statistics to see more information about the job.

Incorrect exit code 0

Sometimes you can see an exit code of 0 even though your job failed. The two main causes of this are either: a subsequent command exits successfully (for example exit or echo "finished"), or if your job contains sub-processes and the main process is not alerted if any sub-process fails.

DOS / Windows newline Characters¶

DOS / Windows uses different characters from Unix to represent newlines in files. Windows/DOS uses carriage return and line feed (\r\n) as a line ending, while Unix systems just use line feed (\n). This can cause issues when a script has been written on a Windows machine and transferred to the cluster.

Typical errors include the following:

line 10: $'\r': command not found

ERROR:105: Unable to locate a modulefile for 'busco/3.0
'

The carriage return before the close quote indicates presence of DOS / Windows newline characters, which can be detected with:

cat -v <script> | grep "\^M"

The file can then be fixed with:

$ dos2unix <script>
dos2unix: converting file <script> to UNIX format ...

Deadlocks¶

Parallel applications may enter a state where each process is waiting for another process to send a message or release a lock on a file or resource. This results in the application ceasing to run as it waits for resources to become available, known as a deadlock. The only solution to a deadlock is adjusting the code to prevent it occurring in the first place.

Monitoring jobs on nodes¶

Jobs can be monitored directly using tools such as top and strace by logging into the node(s) via SSH.

SSH access is only permitted when running a job on a node

We have enabled the pam_slurm_adopt module on all Apocrita compute nodes. You will only be granted SSH access to a node if you have a currently running job or session on that node, and all commands run will be limited to (and share) the same resources requested for that session.

Using `top`¶

You can see all your processes on a node running your job using top:

ssh <node> -t top -u $USER

This can also be filtered to show specific jobs or tasks:

Press f to open the fields display
Use the up and down arrows to navigate to CGROUPS
Press space to select the CGROUPS field
Press q to leave the fields display
Press o to open the filter
Type CGROUPS=<JID> where <JID> is your job id or job and task id, e.g. 254210 or 254210.1
Press Enter

You can now see displayed only processes that are part of that job, e.g:

Using `strace`¶

strace is a tool that lists the system calls a process makes, this allows you to see what a process is doing. This can be useful for identifying deadlocked processes.

strace can either invoke the command to trace or be attached to a running process:

# Run the command `hostname` and trace
strace hostname
# Trace the currently running process 1234
strace -p 1234

Common useful arguments to strace are:

-f Trace forked processes
-t Prefix each output line with a timestamp
-v Full versions of common calls
-s <size> Specify the maximum string size to print (the default is 32).

Things to look out for that suggest a deadlock are:

A continuous stream of poll resulting in Timeout

A continuous stream of sched_yield