Tuesday, June 26, 2018

[XLA JIT] How to turn on XLA JIT compilation at multiple GPUs training

Before I discuss this question, let's recall how to turn on  XLA JIT compilation and use it in TensorFlow python API.

1. Session
Turning on JIT compilation at the session level will result in all possible operators being greedily compiled into XLA computations. Each XLA computation will be compiled into one or more kernels for the underlying device.

2. Manual
JIT compilation can also be turned on manually for one or more operators. This is done by tagging the operators to compile with the attribute _XlaCompile=true. The simplest way to do this is via the tf.contrib.compiler.jit.experimental_jit_scope() scope defined in tensorflow/contrib/compiler/jit.py.

3. Placing operators on XLA devices ( we won't consider this option due to too tedious work)
Another way to run computations via XLA is to place an operator on a specific XLA device. This method is normally only used for testing.

Basically, I have tried first two options in this script: cifar10_multi_gpu_train.py because it already contains codes of multiple GPUs with synchronous updates.

For the first option(Session), it doesn't work with multiple GPUs training. This option will force TensorFlow to compile all possible options into XLA computation and we don't know that the code design in cifar10_multi_gpu_train.py for multiple GPUs with synchronous updates still exists.

But, for the second option(Manual), it can work with multiple GPUs training using JIT scope.

Here is my experiment using second option (Manual with using jit scope)
Source: https://github.com/tensorflow/models/blob/master/tutorials/image/cifar10/cifar10_multi_gpu_train.py
Batch Size: 6000
Total Iterations: 2000

How to turn on XLA JIT compilation in cifar10_multi_gpu_train.py?
#Add jit_scope definition in the early beginning of your code
jit_scope = tf.contrib.compiler.jit.experimental_jit_scope

#Add jit_scope() scope
    with tf.variable_scope(tf.get_variable_scope()):
      for i in xrange(FLAGS.num_gpus):
        with tf.device('/gpu:%d' % i):
          with jit_scope():  # <-- Add this line
            with tf.name_scope('%s_%d' % (cifar10.TOWER_NAME, i)) as scope:
              # Dequeues one batch for the GPU
              image_batch, label_batch = batch_queue.dequeue()

Case1: Turning on XLA JIT ( using jit scope )
Training time: 910 seconds
Avg. images/sec: 36647
Memory Usage and GPU-Util%:

Case2: Turning off XLA JIT ( no XLA )
Training time: 1120 seconds
Avg. images/sec: 25203
Memory Usage and GPU-Util%:

In Summary:
Turning on XLA JIT at multiple GPUs training in this experiment, the training performance is improved by more than 18%.

No comments: