Frequently asked questions¶
On this page we list some common problems experienced when using the cluster, with suggestions on how to resolve them. If you contact us asking for help, please point to any solutions listed here that you have tried.
Why do I see an error like "$'\r': command not found"?¶
It is likely that you created the job script on a Windows machine, which uses
different newline characters. The issue is easily fixed by converting your file
to Unix newlines as described
here.
We also recommend using one of the native text editors such as vim or nano
(some people find nano more intuitive for basic text editing) to edit your job
scripts directly on Apocrita. Note that while vim is available natively, the
nano module will need to be loaded first before you can use it.
Why do I get "ssh: connect to host login.hpc.qmul.ac.uk port 22: Connection refused" or similar message when trying to connect?¶
We use a system to protect against brute-force attacks on the system. If you have 5 failed login attempts within 10 minutes, you will be automatically be locked out for 30 minutes. It is likely you are attempting to authenticate with an incorrect password. If you receive a "connection timed out" message, this may be a network issue, or your ISP is blocking access to SSH port 22. In this instance, it is worth checking if SSH connections work to other machines you have access to, and contacting your ISP/network provider.
What can I do when my program fails to run with an error message like "cannot open shared object file: No such file or directory"?¶
Usually this means that the software has been dynamically linked - at runtime,
the environment needs to know where at external library dependencies are
located. If the library is provided by GCC, for example, then loading the
relevant gcc module will add additional directories to the LD_LIBRARY_PATH
for the system to search when the program is run. Additionally the ldd
command will show the shared object dependencies of a compiled file. If you
are struggling to identify the missing library, please get in touch with the
team and provide the steps necessary to reproduce the issue, and we will
investigate for you.
How can I build my program when I see an error message like "/usr/bin/ld: cannot find -llibrary"?¶
The environment cannot find certain dependencies to build the program.
Identifying the correct module or environment that provides this library will
be a matter of experience, but often there will be a module with a similar
sounding name e.g. "/usr/bin/ld: cannot find -lgsl may be resolved by loading
the relevant GSL module into the environment. This is not always the best
solution, but it is a good starting point.
How can I install a package or program when I get "permission denied"?¶
Contrary to how you might install an application on a personal device, the applications on Apocrita are not installed as part of the Operating System on each compute node, but is installed to shared storage mounted on all of the nodes. To install an application which is suitable for Apocrita but isn't currently provided by us, there are a couple of options.
- Install it locally within your own home folder or shared project
- Request that we install it for all users
If you are seeking to install it in your own storage space, when following
instructions designed for personal devices, there may be a step that attempts
to install the files into /usr/local/bin or some other space limited to
administrators. You will need to specify an install location within your home
folder or research project folder which you have full permissions to write
into. There may also be instructions which tell you to use the sudo
command to elevate privileges to administrative access. Any commands attempting
to use sudo will fail due to lack of access rights.
Why does my job have lots of threads running but each using little CPU?¶
Some applications attempt to auto-detect the number of cores available to your
job, but often the result is that the application attempts to run as many
threads as there are processor cores on the entire compute node, rather than
what you have requested for your job. Fortunately, a lot of applications also
allow you to manually specify the number of cores available - where this is the
case, you can provide the variable $SLURM_NTASKS which takes the value of the
number of cores you requested for your job. An example is
here.
Why did my job fail with an OUT_OF_MEMORY status?¶
An OUT_OF_MEMORY exit status indicates insufficient RAM was requested for
the job. The jobstats tool is helpful in determining how much RAM your
jobs used. If the failure happens in the very early stage of the job execution,
try using the computeshort partition with -p computeshort and 1h runtime
-t 1:0:0, and testing with higher RAM sizes, as the queueing time should be
much shorter compared with the standard compute partition. When the error no
longer occurs, you can try the job again on the compute partition by
requesting more time.
Why did my program work fine after build, but fails when submitted as a job?¶
If you built your custom program after loading additional modules (for example GCC, Java, or other), you also need to load the exact versions in your job script, otherwise the job will fail due to missing libraries or headers.
Can I run a Docker container on Apocrita?¶
While testing Docker, we found that it is possible to escalate user privileges, which is a considerable security risk, so we (and other HPC sites) don't have the Docker software installed. However, Apptainer (previously known as Singularity) is a container solution designed for HPC services which is compatible with Docker, and you can download and run Docker containers with Apptainer.
How can I fix an "UNPROTECTED PRIVATE KEY FILE!" warning?¶
This error is shown when the permissions on your hidden .ssh directory
(likely in your home directory), and your private SSH keys are not secure
enough for the SSH protocol on your local machine. OpenSSH will generate
the error when you attempt to use the private key. To fix this, you will
need to reset the permissions back to the default on your local machine:
chmod 755 ~/.ssh
chmod 600 ~/.ssh/*
How can I fix a "Permission denied (publickey)" error?¶
This error is shown when you are not using your private SSH key when connecting or you are using the wrong private ssh key. Please also confirm that you have uploaded (QMUL users only) your public key in the correct format, and we have accepted it. In some cases this error is shown when your account is suspended. If you are sure that your account is active, please check both of the following:
- You are using your private SSH key when connecting.
- Your private ssh key is correct and has not been overwritten with a new one.