Andrena cluster¶

The Andrena cluster is a set of GPU nodes which were purchased with a Research Capital Investment Fund to support the University's Digital Environment Research Institute.

Hardware¶

The cluster comprises 16 GPU nodes - each with 4 GPUs, providing a total of 64 Nvidia A100 GPUs. The Andrena nodes are joined to Apocrita and make use of the same job scheduler and high performance networking/storage.

Requesting access¶

To request access to the Andrena computational resources or storage, please contact us to discuss requirements.

Logging in to Andrena¶

The connection procedure is the same as Apocrita's login procedure. Please refer to the documentation below for information about how to submit jobs specifically to the Andrena cluster nodes.

Running jobs on Andrena¶

Workloads are submitted using the job scheduler and work exactly the same way as Apocrita, which is documented thoroughly on this site. If you have been approved to use Andrena, jobs can be submitted using the following additional request in the resource request section of the job script:

#SBATCH -p andrena      # (or --partition=andrena)

Submitted jobs should follow a similar template to Apocrita GPU jobs, with the exception of the aforementioned partition. All Andrena jobs will request 12 tasks per GPU, and 7500M RAM per task by default.

Request RAM in megabytes, not GB

Andrena node jobs should request 12 cores per GPU requested and 7500M RAM per core. Larger RAM requests per core will be rejected. Slurm won't accept a float like 7.5G for --mem-per-cpu, so you should use --mem-per-cpu=7500M to make the request in megabytes.

An example GPU job script using a conda environment might look like:

#!/bin/bash
#SBATCH -J jobname
#SBATCH -o %x.o%j            # single STDOUT/STDERR output file jobname.o<job number>
#SBATCH -p andrena           # request the andrena partition
#SBATCH -n 12                # 12 tasks
#SBATCH --cpus-per-gpu=12    # 12 cores per GPU
#SBATCH -t 240:0:0           # 240 hours runtime
#SBATCH --mem-per-cpu=7500M  # 12 * 7500M = 90G total system RAM
#SBATCH --gres=gpu:1         # request 1 GPU of any type

module load miniforge
mamba activate tensorflow_env
python train.py

A typical GPU job script using virtualenv will look similar. Some applications such as PyTorch are packaged with necessary GPU libraries built-in, therefore it is not required to load any additional modules for GPU support.

However, CUDA libraries are not always installed as part of a pip install, so it may be necessary to load the relevant cudnn module to make the CUDNN and CUDA libraries available in your virtual environment. Note that loading the cudnn module also loads a compatible cuda module.

#!/bin/bash
#SBATCH -J jobname
#SBATCH -o %x.o%j            # single STDOUT/STDERR output file jobname.o<job number>
#SBATCH -p andrena           # request the andrena partition
#SBATCH -n 12                # 12 tasks
#SBATCH --cpus-per-gpu=12    # 12 cores per GPU
#SBATCH -t 240:0:0           # 240 hours runtime
#SBATCH --mem-per-cpu=7500M  # 12 * 7500M = 90G total system RAM
#SBATCH --gres=gpu:1         # request 1 GPU of any type

module load python cudnn
source venv/bin/activate
python train.py