Overview
I recently investigated using multiple GPUs using Tensorflow. There's a bunch of
useful documentation for doing
distributed training with TF. The tf.distribute.Strategy API is extremely powerful already and
seems to be getting better release on release. It has a lot of support for Keras models. It also works
with custom training loops and general Tensorflow backed computation. As the docs
say
however this requires a bit more effort. The following is a method I found of using multiple GPUs without
tf.distribute.Strategy that I found useful for simple cases that I thought I'd share. I'm going
to illustrate with an example that takes advantage of data parallelism but the principle can be applied to
model parallelism too. I'm going to assume some familiarity with terminology in what follows but if needed the
docs above are very good at introducing ways GPU parallelism can be achieved.
Example - data parallelism
Say we have a function of TF operations that take in a batch of input data. As we are going to exploit data
parallelism, we assume our function can work on each sample in the batch individually.
import tensorflow as tf
def expensive_operation(input_data: tf.data.Dataset):
"""Beastly function that does a bunch of expensive operations on input."""
...
return output_data
We're currently exploiting
a GPU to accelerate these operations but it's still running too slowly.
If you have multiple GPU devices available on your host then out of the box Tensorflow will pick the lowest index one by default and all the others will be
left alone.
GPU utilisation
The first thing we will want to do to check if our function is using that single GPU efficiently. Ideally we want
GPU utilisation to be over 90% (this means over 90% of the cores on the GPU are doing something). If GPU utilisation
isn't high then your operations aren't currently a good candidate for exploiting other GPUs as you're better off trying
to use the single one more efficiently first. The simplest way
to check this is to use nvidia-smi. Running the below will snapshot information about your GPUs every
num_secs seconds and allow you to tell whether the utilisation is where it needs to be.
nvidia-smi watch -n num_secs
If so,
great, crack onto the next stage! If not there is some
useful documentation
on how to increase that utilisation.
Using tf.device
Get a list of available devices
Tensorflow
provides
the tf.device context manager for manually pinning operations to a certain device.
To get a list of what GPU devices you have available by name use:
gpus = tf.config.list_physical_devices('gpu')
Pin to single device
We can now explicitly pin our operation to any GPU we want. For example let's imagine we have 2 GPUs available and pin the above operations to the second one (indexed by 1).
gpus = tf.config.list_physical_devices('gpu')
with tf.device(gpus[1].name):
expensive_operation(input_data)
Run on multiple devices (sequentially)
It's as easy as that, I really like how Tensorflow enables you device access so easily.
Now we can use multiple GPUs at the same time!
gpus = tf.config.list_physical_devices('gpu')
num_gpus = len(gpus)
batch_data = input_data.batch(num_gpus)
for batch, gpu in zip(batch_data, gpus):
with tf.device(gpu.name):
expensive_operation(batch)
Run on multiple devices in parallel!
Hang on a sec. With Tensorflow running in eager execution mode (as it does by default). This
will just run on each of the GPUs sequentially. There is a
section
in the Tensorflow docs that describes the above too.
The trick here is to run in graph mode and then everything will run in parallel.
@tf.function
def parallel_expensive_operation(input_data: tf.data.Dataset):
gpus = tf.config.list_physical_devices('gpu')
num_gpus = len(gpus)
batch_data = input_data.batch(num_gpus)
for batch, gpu in zip(batch_data, gpus):
with tf.device(gpu.name):
expensive_operation(batch)
On the first call of the tf.function we have to pay a cost to compile the graph but usually
this shouldn't be too bad..
Capturing output
The last piece that's worth mentioning explicitly is about capturing output. As we're now working inside graph mode
we can't simply append outputs to a list, as Python side effects are only executed
on the first iteration of the graph (at compile time). One solution to this is to use a
tf.TensorArray instead of a Python list to store our outputs.
As a quick example..
@tf.function
def parallel_expensive_operation(input_data: tf.data.Dataset):
gpus = tf.config.list_physical_devices('gpu')
num_gpus = len(gpus)
results_container = tf.TensorArray(
dtype=...,
size=num_gpus,
infer_shape=False, # Required for different shape Tensors given by data shard.
)
batch_data = input_data.batch(num_gpus)
for i, (batch, gpu) in enumerate(zip(batch_data, gpus)):
with tf.device(gpu.name):
results_container = results_container.write(i, expensive_operation(batch))
That's it!
There a couple of things to be wary of when using multiple GPUs. For example
any tf.Variables are created on a specific device. On a multi-gpu host this will be the GPU with the
lowest index unless specified otherwise. Depending on the data transfer speed between your GPU devices this can be
quite important as it means that variables need to be read from one GPU device to another. If for example your data
transfer speed is slow and a variable is read across devices many times we want to avoid this. I may write about
that in a seperate blog post but hopefully this has been useful for now! If you're interested in doing
distributed computations two more useful things to check out are
Jax
and
Ray.
Thanks for reading :)