Wednesday, January 16, 2019

[TensorFlow] How to use Distribution Strategy in TensorFlow?

Learned from What’s coming in TensorFlow 2.0, TensorFlow 2.0 is coming soon and there are several features which are ready to use already, for instance, Distribution Strategy. Quoted from the article,
"For large ML training tasks, the Distribution Strategy API makes it easy to distribute and train models on different hardware configurations without changing the model definition. Since TensorFlow provides support for a range of hardware accelerators like CPUs, GPUs, and TPUs, you can enable training workloads to be distributed to single-node/multi-accelerator as well as multi-node/multi-accelerator configurations, including TPU Pods. Although this API supports a variety of cluster configurations, templates to deploy training on Kubernetes clusters in on-prem or cloud environments are provided."


To find out what distribution strategies TensorFlow provide, we can check out here:
https://github.com/tensorflow/tensorflow/tree/master/tensorflow/contrib/distribute
Quoted from the site:
"Currently, we support several types of strategies:
  • MirroredStrategy: This does in-graph replication with synchronous training on many GPUs on one machine. Essentially, we create copies of all variables in the model's layers on each device. We then use all-reduce to combine gradients across the devices before applying them to the variables to keep them in sync.
  • CollectiveAllReduceStrategy: This is a version of MirroredStrategy for multi-worker training. It uses a collective op to do all-reduce. This supports between-graph communication and synchronization, and delegates the specifics of the all-reduce implementation to the runtime (as opposed to encoding it in the graph). This allows it to perform optimizations like batching and switch between plugins that support different hardware or algorithms. In the future, this strategy will implement fault-tolerance to allow training to continue when there is worker failure.
  • ParameterServerStrategy: This strategy supports using parameter servers either for multi-GPU local training or asynchronous multi-machine training. When used to train locally, variables are not mirrored, instead they are placed on the CPU and operations are replicated across all local GPUs. In a multi-machine setting, some are designated as workers and some as parameter servers. Each variable is placed on one parameter server. Computation operations are replicated across all GPUs of the workers."
So as we can learn that the distribution Strategies in TensorFlow is very convenient for users to adopt in training process. To prove how convenient it is, I find a good example to demonstrate the result:

https://github.com/shu-yusa/tensorflow-mirrored-strategy-sample/blob/master/cnn_mnist.py
In this example, it uses MirroredStrategy as follows and your model runs on 2 GPUs
distribution = tf.contrib.distribute.MirroredStrategy(num_gpus=num_gpus)


You also can define your available devices, such as using CPU as below. It also works this way to train the model using GPU:0, GPU:1, and CPU:0. Be aware that this way is much slower.
distribution = tf.contrib.distribute.MirroredStrategy(["/gpu:0", "/gpu:1", "/cpu:0"])
If you want to try CollectiveAllReduceStrategy, check it out:
distribution = tf.contrib.distribute.CollectiveAllReduceStrategy(num_gpus_per_worker=2)

P.S 1: In TensorFlow 2.0, you should use tf.distribute directly instead of tf.contrib.distribute
P.S 2: I got an error with ParameterServerStrategy, so I skip it.


No comments: