Debugging your jobs¶
This page provides some general information on debugging jobs that are not submitting, running or completing. If you still cannot resolve the issue without assistance, please contact us, supplying all the relevant information.
Failure to submit¶
A job may be rejected by the scheduler and fail to submit if incorrect scheduler parameters are provided.
Check the following:
- The memory request has been adjusted for the number of tasks requested
- Requested resources are a reasonable value e.g.
--timeis less than 10 days - Specified scheduler parameters are valid for the partition(s) the job is being submitting to (i.e. requesting a GPU for all GPU partitions)
- You have permission to access the resources you're requesting
In cases where the job fails to submit, the scheduler should provide a failure message. If you are unsure what this means, please contact us.
Failure to run¶
Jobs may wait in the queue for a long period depending on the availability of resources, they may also have incorrect resource requests that prevent them running.
To check the current status of your job(s), run squeue --me. This will show
all queued and running jobs(s) submitted by you. Optionally, add the
--states=pending option to only show queued jobs. See the official Slurm
documentation for additional squeue
options.
Whilst the job is queueing, the output of squeue will present the reason
why the job is being held in the queue in the final column, within brackets.
A few common reasons and their explanations have been listed below:
| Reason | Explanation |
|---|---|
(Resources) |
The job is waiting for cluster resources to become available |
(Priority) |
Another queued job with a higher priority exists for the requested partition(s) |
(Dependency) |
This job has a dependency on another job that has not been satisfied |
(QOSMaxCpuPerUserLimit) |
The CPU/task request exceeds the maximum quota applied per user |
(QOSMaxGRESPerUser) |
The GRES (GPU) request exceeds the maximum quota applied per user |
See the official Slurm documentation for more information about job reason codes.
Please contact us if you are unsure why your job is queueing.
Failure to complete¶
A job may fail to run for a number of reasons:
- lack of disk quota
- bad characters in the script
- insufficient resources requested, check the resource usage
Check the job output for the following:
- syntax errors in your script
- code failing to run and exiting with an error
- code failing to run because an expected file or directory did not exist
- permissions problem (can't read or write certain files)
- mismatch between tasks requested for the job, and tasks used by the
application. To avoid this you should use
$SLURM_NTASKSto provide the correct number of tasks to the application.
If you're using software which needs a licence like ansys or matlab, check that obtaining a licence was successful.
Job exit state¶
All successful jobs will finish with the COMPLETED state (exit status 0). If
your job does not exit with this state, it has failed somehow. Please check the
job output file for more details.
You may also enable email options or check the the job statistics to see more information about the job.
Incorrect exit code 0
Sometimes you can see an exit code of 0 even though your job failed. The
two main causes of this are either: a subsequent command exits
successfully (for example exit or echo "finished"), or if your job
contains sub-processes and the main process is not alerted if any
sub-process fails.
DOS / Windows newline Characters¶
DOS / Windows uses
different characters
from Unix to represent newlines in files. Windows/DOS uses carriage return and
line feed (\r\n) as a line ending, while Unix systems just use line feed
(\n). This can cause issues when a script has been written on a Windows
machine and transferred to the cluster.
Typical errors include the following:
line 10: $'\r': command not found
ERROR:105: Unable to locate a modulefile for 'busco/3.0
'
The carriage return before the close quote indicates presence of DOS / Windows newline characters, which can be detected with:
cat -v <script> | grep "\^M"
The file can then be fixed with:
$ dos2unix <script>
dos2unix: converting file <script> to UNIX format ...
Deadlocks¶
Parallel applications may enter a state where each process is waiting for another process to send a message or release a lock on a file or resource. This results in the application ceasing to run as it waits for resources to become available, known as a deadlock. The only solution to a deadlock is adjusting the code to prevent it occurring in the first place.
Monitoring jobs on nodes¶
Jobs can be monitored directly using tools such as top and strace by
logging into the node(s) via SSH.
SSH access is only permitted when running a job on a node
We have enabled the pam_slurm_adopt module on all Apocrita compute nodes. You will only be granted SSH access to a node if you have a currently running job or session on that node, and all commands run will be limited to (and share) the same resources requested for that session.
Using top¶
You can see all your processes on a node running your job using top:
ssh <node> -t top -u $USER
This can also be filtered to show specific jobs or tasks:
- Press
fto open the fields display - Use the
upanddownarrows to navigate toCGROUPS - Press
spaceto select theCGROUPSfield - Press
qto leave the fields display - Press
oto open the filter - Type
CGROUPS=<JID>where<JID>is your job id or job and task id, e.g.254210or254210.1 - Press
Enter
You can now see displayed only processes that are part of that job, e.g:
Using strace¶
strace is a tool that lists the system calls a process makes, this allows you
to see what a process is doing. This can be useful for identifying deadlocked
processes.
strace can either invoke the command to trace or be attached to a running
process:
# Run the command `hostname` and trace
strace hostname
# Trace the currently running process 1234
strace -p 1234
Common useful arguments to strace are:
-fTrace forked processes-tPrefix each output line with a timestamp-vFull versions of common calls-s <size>Specify the maximum string size to print (the default is 32).
Things to look out for that suggest a deadlock are:
- A continuous stream of
pollresulting inTimeout
- A continuous stream of
sched_yield