- CUDA stream
- Host_to_Device stream
- Device_to_Host stream
- Device_to_Device stream
By default, these 4 kinds of streams only will have 1 stream for each. Please check out the following code:
class GPUDevice : public BaseGPUDevice {
public:
GPUDevice(const SessionOptions& options, const string& name,
Bytes memory_limit, const DeviceLocality& locality,
TfGpuId tf_gpu_id, const string& physical_device_desc,
Allocator* gpu_allocator, Allocator* cpu_allocator)
: BaseGPUDevice(options, name, memory_limit, locality, tf_gpu_id,
physical_device_desc, gpu_allocator, cpu_allocator,
false /* sync every op */, 1 /* max_streams */) {
if (options.config.has_gpu_options()) {
force_gpu_compatible_ =
options.config.gpu_options().force_gpu_compatible();
}
}
...
...
If I change it to 2, does it help to improve the training or inference speed? In my experiment, the answer is "No". Please see the pictures below:
My case does a lot of memcpy between GPU and CPU devices and "stream=2" doesn't help to improve the processing time and transmission time. The result also makes sense because the bottleneck is in GPU SM for data processing time and PCIe for data transmission time.
P.S: The related information:
https://stackoverflow.com/questions/42869340/tensorflow-why-does-a-gpu-device-only-have-one-device-context-and-does-it-rea
No comments:
Post a Comment