Running on GPUs

Overview

Teaching: 10 min
Exercises: 10 min
Questions
  • How do I run software that makes use of a GPU?

Objectives
  • Be able to submit jobs that can run on a GPU

While first designed for making graphics in video games look more impressive, Graphics Processing Unit (GPU) accelerators have since been found to give very high performance at numerically-intensive computation tasks (at the expense of the flexibility offered by a CPU). In particular, they have recently been found to perform very well for machine learning workflows, giving orders of magnitude more speed than the CPU that drives them.

What’s available

Supercomputing Wales provides access to NVIDIA V100 GPUs on both the SUNBIRD and HAWK clusters. Additionally, latest-generation NVIDIA A100 GPUs are available on SUNBIRD as part of the AccelerateAI facility. Since these nodes are separate from the nodes used so far for CPU computation, we need to specify a different partition for Slurm to allocate the correct nodes.

System Partition GPUs available per node Number of Nodes Access
SUNBIRD gpu 2 x V100 16GB 4 Swansea and Aberystwyth users
SUNBIRD s_gpu_eng 4 x V100 32GB 1 Swansea Engineering users**
SUNBIRD accel_ai * 8 x A100 40GB 5 AccelerateAI users**
SUNBIRD accel_ai_mig 24 x A100 10GB, 8 x A100 5GB (multi-instance GPU) 1 AccelerateAI users**, for interactive use
HAWK gpu 2 x P100 16GB 13 Cardiff and Bangor users
HAWK gpu_v100 2 x V100 16GB 15 Cardiff and Bangor users

* In addition to the accel_ai partition, there is also an accel_ai_dev partition that gives access to the same nodes, but accepts a smaller number of shorter jobs in exchange for higher priority access. This is designed for running short tests before starting full runs on the accel_ai partition.

** For the AccelerateAI and s_gpu_eng partitions, please include a note in your project request that you will need access to these resources, and how you will use them. The technical team will then ensure that you are given access.

sbatch options for GPUs

In order to submit to the GPU partition, we need to add two lines to our job scripts:

#SBATCH --partition=partition_name_goes_here
#SBATCH --gres=gpu:number_of_gpus_goes_here

Here, replace partition_name_goes_here with a partition name from the table above, and replace number_of_gpus_goes_here with the number of GPUs you want to use (most frequently 1). Slurm will then find a free GPU and ensure it is reserved for your job. Most GPU-enabled software (including common machine learning libraries like Tensorflow and PyTorch) will detect which GPU Slurm has assigned and automatically use it.

To test this, let’s run an example using Tensorflow.

Firstly, we need to create a new file called tf_simple.py, for example using nano. The following program will use Tensorflow to perform some basic arithmetic, after checking that the GPU is available:

import tensorflow as tf

print(tf.config.list_physical_devices('GPU'))
print(tf.test.is_built_with_cuda())
print(tf.test.gpu_device_name())
print(tf.config.get_visible_devices())

tf.debugging.set_log_device_placement(True)

# Create some tensors
a = tf.constant([[1.0, 2.0, 3.0], [4.0, 5.0, 6.0]])
b = tf.constant([[1.0, 2.0], [3.0, 4.0], [5.0, 6.0]])
c = tf.matmul(a, b)

print(a, "times", b, "equals", c)

Now, to submit this to run on the cluster, we can create a job script called submit_tf.sh:

#!/bin/bash --login
###
# job name
#SBATCH --job-name=tensorflow_test
# job stdout file
#SBATCH --output=tensorflow_test.out.%J
# job stderr file
#SBATCH --error=tensorflow_test.err.%J
# maximum job time in D-HH:MM
#SBATCH --time=0-00:05
# specify our current project
# change this for your own work
#SBATCH --account=scwXXXX
# Specify the GPU partition
# (If the GPU partition is busy, the instructor may
# recommend a different one)
#SBATCH --partition=gpu
# Specify how many GPUs we would like to use
#SBATCH --gres=gpu:1
###

# Load Anaconda and activate our environment with Tensorflow installed
module load anaconda/2021.05
source activate scw_test

python tf_simple.py

This can now be submitted to the queue using sbatch:

$ sbatch submit_tf.sh

Once this runs, then the output will look something like the following:

[PhysicalDevice(name='/physical_device:GPU:0', device_type='GPU')]
True
/device:GPU:0
[PhysicalDevice(name='/physical_device:CPU:0', device_type='CPU'), PhysicalDevice(name='/physical_device:GPU:0', device_type='GPU')]
tf.Tensor(
[[1. 2. 3.]
 [4. 5. 6.]], shape=(2, 3), dtype=float32) times tf.Tensor(
[[1. 2.]
 [3. 4.]
 [5. 6.]], shape=(3, 2), dtype=float32) equals tf.Tensor(
[[22. 28.]
 [49. 64.]], shape=(2, 2), dtype=float32)

It shows that the GPU /device:GPU:0 is available to this job.

Train a neural network

Copy the file /home/scw1389/tensorflow/test_train.py to your home directory. This program is borrowed from the Tensorflow tutorial, and will train a small neural network to recognise handwritten digits, a common example problem in machine learning.

Adjust the job script we wrote above to run this code on the GPU, and test whether it works.

Key Points

  • Use --partition=gpu or --partition=accel_ai to submit to a partition with GPUs

  • Use --gres=gpu:1 (or similar) to specify the number of GPUs you need.