Using multiple GPUs with Tensorflow

Overview

I recently investigated using multiple GPUs using Tensorflow. There's a bunch of useful documentation for doing distributed training with TF. The tf.distribute.Strategy API is extremely powerful already and seems to be getting better release on release. It has a lot of support for Keras models. It also works with custom training loops and general Tensorflow backed computation. As the docs say however this requires a bit more effort. The following is a method I found of using multiple GPUs without tf.distribute.Strategy that I found useful for simple cases that I thought I'd share. I'm going to illustrate with an example that takes advantage of data parallelism but the principle can be applied to model parallelism too. I'm going to assume some familiarity with terminology in what follows but if needed the docs above are very good at introducing ways GPU parallelism can be achieved.

Example - data parallelism

Say we have a function of TF operations that take in a batch of input data. As we are going to exploit data parallelism, we assume our function can work on each sample in the batch individually.

    
        import tensorflow as tf

        def expensive_operation(input_data: tf.data.Dataset):
            """Beastly function that does a bunch of expensive operations on input."""
            ...

            return output_data

We're currently exploiting a GPU to accelerate these operations but it's still running too slowly. If you have multiple GPU devices available on your host then out of the box Tensorflow will pick the lowest index one by default and all the others will be left alone.

GPU utilisation

The first thing we will want to do to check if our function is using that single GPU efficiently. Ideally we want GPU utilisation to be over 90% (this means over 90% of the cores on the GPU are doing something). If GPU utilisation isn't high then your operations aren't currently a good candidate for exploiting other GPUs as you're better off trying to use the single one more efficiently first. The simplest way to check this is to use nvidia-smi. Running the below will snapshot information about your GPUs every num_secs seconds and allow you to tell whether the utilisation is where it needs to be.

    
        nvidia-smi watch -n num_secs

If so, great, crack onto the next stage! If not there is some useful documentation on how to increase that utilisation.

Using `tf.device`

Get a list of available devices

Tensorflow provides the tf.device context manager for manually pinning operations to a certain device. To get a list of what GPU devices you have available by name use:

    
        gpus = tf.config.list_physical_devices('gpu')

Pin to single device

We can now explicitly pin our operation to any GPU we want. For example let's imagine we have 2 GPUs available and pin the above operations to the second one (indexed by 1).

    
        gpus = tf.config.list_physical_devices('gpu')
        with tf.device(gpus[1].name):
            expensive_operation(input_data)

Run on multiple devices (sequentially)

It's as easy as that, I really like how Tensorflow enables you device access so easily. Now we can use multiple GPUs at the same time!

    
        gpus = tf.config.list_physical_devices('gpu')
        num_gpus = len(gpus)
        batch_data = input_data.batch(num_gpus)
        for batch, gpu in zip(batch_data, gpus):
            with tf.device(gpu.name):
                expensive_operation(batch)

Run on multiple devices in parallel!

Hang on a sec. With Tensorflow running in eager execution mode (as it does by default). This will just run on each of the GPUs sequentially. There is a section in the Tensorflow docs that describes the above too. The trick here is to run in graph mode and then everything will run in parallel.

    
        @tf.function
        def parallel_expensive_operation(input_data: tf.data.Dataset):
            gpus = tf.config.list_physical_devices('gpu')
            num_gpus = len(gpus)
            batch_data = input_data.batch(num_gpus)
            for batch, gpu in zip(batch_data, gpus):
                with tf.device(gpu.name):
                    expensive_operation(batch)

On the first call of the tf.function we have to pay a cost to compile the graph but usually this shouldn't be too bad..

Capturing output

The last piece that's worth mentioning explicitly is about capturing output. As we're now working inside graph mode we can't simply append outputs to a list, as Python side effects are only executed on the first iteration of the graph (at compile time). One solution to this is to use a tf.TensorArray instead of a Python list to store our outputs. As a quick example..

    
        @tf.function
        def parallel_expensive_operation(input_data: tf.data.Dataset):
            gpus = tf.config.list_physical_devices('gpu')
            num_gpus = len(gpus)

            results_container = tf.TensorArray(
                dtype=...,
                size=num_gpus,
                infer_shape=False,  # Required for different shape Tensors given by data shard.
            )
            batch_data = input_data.batch(num_gpus)
            for i, (batch, gpu) in enumerate(zip(batch_data, gpus)):
                with tf.device(gpu.name):
                    results_container = results_container.write(i, expensive_operation(batch))

That's it!

There a couple of things to be wary of when using multiple GPUs. For example any tf.Variables are created on a specific device. On a multi-gpu host this will be the GPU with the lowest index unless specified otherwise. Depending on the data transfer speed between your GPU devices this can be quite important as it means that variables need to be read from one GPU device to another. If for example your data transfer speed is slow and a variable is read across devices many times we want to avoid this. I may write about that in a seperate blog post but hopefully this has been useful for now! If you're interested in doing distributed computations two more useful things to check out are Jax and Ray. Thanks for reading :)