Job runtime¶
Runtime, signified by -t or --time in the job script, defines the maximum
length of time a job is allow to run for. Jobs which exceed the requested
runtime will be automatically killed by the scheduler with an exit state of
TIMEOUT.
Queueing time depends mostly on tasks (cores) and RAM requested, not runtime. Since jobs exceeding the requested runtime will be killed, most users should request either:
- 1 hour (to use the
computeshortandgpushortpartitions), or - 240 hours (10 days, the maximum allowed on all other partitions)
Requesting the maximum runtime helps prevent jobs from ending prematurely when they could have completed within the 10-day limit.
There are some edge cases that could mean a job requesting less than 10 days will get queued ahead of a 10-day job, but these usually relate to situations where we have reserved resources at a future date (e.g. maintenance periods).
The 240 hour limit is a global setting, and cannot be changed for individual jobs or users. If you are submitting long running jobs, you should consider:
- Attempting to parallelise the job
- Consider if the job can be broken into smaller parts
- Profiling the code to check for bottlenecks
- Implementing checkpointing (a method of regularly dumping the job's state so that it can be restarted - check if your application supports this)