Tuesday, October 23, 2018

[TensorFlow] Does it help the processing time and transmission time if increasing CUDA Steam number in TensorFlow?

Before starting to increase the CUDA Steam number in TensorFlow, I want to recap some ideas about the Executor module. When TensorFlow session runs, it will build Executor. Meanwhile, if you enable CUDA in TensorFlow build configuration, the Executor will add visible GPU devices and create TF device object (GPUDevice object) mapping to physical GPU device. There are 4 kinds of streams inside GPUDevice:

  • CUDA stream 
  • Host_to_Device stream
  • Device_to_Host stream
  • Device_to_Device stream



By default, these 4 kinds of streams only will have 1 stream for each. Please check out the following code:
class GPUDevice : public BaseGPUDevice {
 public:
  GPUDevice(const SessionOptions& options, const string& name,
            Bytes memory_limit, const DeviceLocality& locality,
            TfGpuId tf_gpu_id, const string& physical_device_desc,
            Allocator* gpu_allocator, Allocator* cpu_allocator)
      : BaseGPUDevice(options, name, memory_limit, locality, tf_gpu_id,
                      physical_device_desc, gpu_allocator, cpu_allocator,
                      false /* sync every op */, 1 /* max_streams */) {
    if (options.config.has_gpu_options()) {
      force_gpu_compatible_ =
          options.config.gpu_options().force_gpu_compatible();
    }
  }
  ...
  ...

If I change it to 2, does it help to improve the training or inference speed? In my experiment, the answer is "No". Please see the pictures below:

My case does a lot of memcpy between GPU and CPU devices and "stream=2" doesn't help to improve the processing time and transmission time. The result also makes sense because the bottleneck is in GPU SM for data processing time and PCIe for data transmission time.



No comments: