Thursday, June 21, 2018

[TensorFlow 記憶體優化實驗] Compare the memory options in Grappler Memory Optimizer

As we know that in Tensorflow, there is an optimization module called "Grappler". It provides many kinds of optimization functionalities, such as: Layout, Memory, ModelPruner, and so on... In this experiment, we can see the effect of some memory options enabled in a simple CNN model using MNIST dataset.



Here is the simple CNN model:

height = 28
width = 28
channels = 1
n_inputs = height * width

conv1_fmaps = 32
conv1_ksize = 3
conv1_stride = 1
conv1_pad = "SAME"

conv2_fmaps = 64
conv2_ksize = 3
conv2_stride = 1
conv2_pad = "SAME"
conv2_dropout_rate = 0.25

pool3_fmaps = conv2_fmaps

n_fc1 = 128
fc1_dropout_rate = 0.5

n_outputs = 10

with tf.device('/cpu:0'):
    with tf.name_scope("inputs"):
        X = tf.placeholder(tf.float32, shape=[None, n_inputs], name="X")
        X_reshaped = tf.reshape(X, shape=[-1, height, width, channels])
        y = tf.placeholder(tf.int32, shape=[None], name="y")
        training = tf.placeholder_with_default(False, shape=[], name='training')

with tf.device('/gpu:0'):
    conv1 = tf.layers.conv2d(X_reshaped, filters=conv1_fmaps, kernel_size=conv1_ksize,
                             strides=conv1_stride, padding=conv1_pad,
                             activation=tf.nn.relu, name="conv1")

    conv2 = tf.layers.conv2d(conv1, filters=conv2_fmaps, kernel_size=conv2_ksize,
                             strides=conv2_stride, padding=conv2_pad,
                             activation=tf.nn.relu, name="conv2")

    with tf.name_scope("pool3"):
        pool3 = tf.nn.max_pool(conv2, ksize=[1, 2, 2, 1], strides=[1, 2, 2, 1], padding="VALID")
        pool3_flat = tf.reshape(pool3, shape=[-1, pool3_fmaps * 14 * 14])
        pool3_flat_drop = tf.layers.dropout(pool3_flat, conv2_dropout_rate, training=training)

    with tf.name_scope("fc1"):
        fc1 = tf.layers.dense(pool3_flat_drop, n_fc1, activation=tf.nn.relu, name="fc1")
        fc1_drop = tf.layers.dropout(fc1, fc1_dropout_rate, training=training)

    with tf.name_scope("output"):
        logits = tf.layers.dense(fc1, n_outputs, name="output")
        Y_proba = tf.nn.softmax(logits, name="Y_proba")

    with tf.name_scope("train"):
        xentropy = tf.nn.sparse_softmax_cross_entropy_with_logits(logits=logits, labels=y)
        loss = tf.reduce_mean(xentropy)
        optimizer = tf.train.AdamOptimizer()
        training_op = optimizer.minimize(loss)

with tf.device('/cpu:0'):
    with tf.name_scope("eval"):
        correct = tf.nn.in_top_k(logits, y, 1)
        accuracy = tf.reduce_mean(tf.cast(correct, tf.float32))

with tf.name_scope("init_and_save"):
    init = tf.global_variables_initializer()
    saver = tf.train.Saver()

There are several memory options for you to use as follows:

  1. NO_MEM_OPT
  2. DEFAULT_MEM_OPT
  3. SWAPPING_HEURISTICS
  4. RECOMPUTATION_HEURISTICS
  5. SCHEDULING_HEURISTICS
  6. HEURISTICS
  7. Third Party: Gradient-Checkpointing
P.S: Gradient-Checkpointing is not related with Grappler's memory optimizer. it is just another approach.

You can use from Item 1 to item 6 and put it in the following place with red color characters.
from tensorflow.core.protobuf import rewriter_config_pb2
rewrite_options = rewriter_config_pb2.RewriterConfig(disable_model_pruning=True)
rewrite_options.memory_optimization = rewriter_config_pb2.RewriterConfig.<Put memory option here>
graph_options = tf.GraphOptions(rewrite_options=rewrite_options) #, infer_shapes=True)
config = tf.ConfigProto(graph_options=graph_options)

config.gpu_options.allow_growth=True
config.allow_soft_placement = True
For Gradient-Checkpoint approach, you should add the following code in the top of your program:
from tensorflow.contrib.memory_stats.python.ops import memory_stats_ops

#monkey patch tf.gradients to point to our custom version, with automatic checkpoint selection
def grads(ys, xs, grad_ys=None, **kwargs):
    return memory_saving_gradients.gradients(ys, xs, grad_ys,
                                             checkpoints='memory', **kwargs)
old_grads = tf.gradients
tf.__dict__["gradients"] = grads

I pick up several batch sizes and compare with the GPU memory usage. The result of memory options enabled is in the table below:

TensorFlow version = 1.8
Memory Option for Optimizer Batch Size: 9000 Batch Size: 11000 Batch Size: 11100 Batch Size: 11105
NO_MEM_OPT OK OOM OOM OOM
DEFAULT_MEM_OPT OK OK OOM OOM
SWAPPING_HEURISTICS OK OK OOM OOM
RECOMPUTATION_HEURISTICS OK OK OOM OOM
SCHEDULING_HEURISTICS OK OK OOM OOM
* HEURISTICS OK OK OK OK
Third Party: Check-pointing OK OOM OOM OOM

In my case, it seems that HEURISTICS is the best choice to optimize the memory usage when the batch size becomes extremely larger.

Update:
The situation is a little bit different in TensorFlow 1.9. Maybe I need to dig into the source code more. 
The max batch size changes to 11052, and the winner is not "HEURISTICS" anymore. 
Here you go:

Max Batch Size: 11052
NO_MEM_OPT: out of memory
SWAPPING_HEURISTICS: 6755.58 MB
RECOMPUTATION_HEURISTICS: 6723.38 MB
SCHEDULING_HEURISTICS: 6755.58 MB
HEURISTICS: out of memory


No comments: