Wednesday, November 14, 2018

[RNN] What are the difference of input and output's tensor shape in dynamic_rnn and static_rnn using TensorFlow

When studying RNN, my first issue encountered in my program is about the shape of input and output tensors. Shape is a very important information to connect between layers. Here I just directly point out what are differences in input/output shape of static RNN and dynamic RNN.
P.S: If you use Keras to write your RNN model, you won't need to deal with these details.

The short example of Static RNN

Please pay a tension about the output shape in the following picture.
batch_size = 32
time_step = 5
input_size = 4
rnn_cell = 20
X = tf.placeholder(tf.float32, shape=[batch_size, time_step, input_size])
x=tf.unstack(X,axis=1)
lstm_cell = rnn.BasicLSTMCell(rnn_cell)
outputs, states = rnn.static_rnn(lstm_cell, x, dtype=tf.float32)
output=outputs[-1]









The short example of Dynamic RNN

Please pay a tension about the output shape in the following picture.

batch_size = 32
time_step = 5
input_size = 4
rnn_cell = 20
X = tf.placeholder(tf.float32, shape=[batch_size, time_step, input_size])
lstm_cell = rnn.BasicLSTMCell(rnn_cell)
outputs, states = rnn.static_rnn(lstm_cell, x, dtype=tf.float32)
outputs=tf.transpose(outputs, [1, 0, 2])
output=outputs[-1]




RNN API:
https://www.tensorflow.org/api_docs/python/tf/nn/static_rnn
https://www.tensorflow.org/api_docs/python/tf/nn/dynamic_rnn

Monday, November 12, 2018

[TensorFlow] The explanation of average gradients by example in data parallelism

When studying some examples of training model using Multi-GPUs ( in data parallelism ), the average gradients function always exists in some kind of ways, and here is a simple version as follows:

def average_gradients(tower_grads):
    average_grads = []
    for grad_and_vars in zip(*tower_grads):
        # Note that each grad_and_vars looks like the following:
        #   ((grad0_gpu0, var0_gpu0), ... , (grad0_gpuN, var0_gpuN))
        grads = []
        for g, _ in grad_and_vars:
            # Add 0 dimension to the gradients to represent the tower.
            expanded_g = tf.expand_dims(g, 0)

            # Append on a 'tower' dimension which we will average over below.
            grads.append(expanded_g)

        # Average over the 'tower' dimension.
        grad = tf.concat(grads, 0)
        grad = tf.reduce_mean(grad, 0)

        # Keep in mind that the Variables are redundant because they are shared
        # across towers. So .. we will just return the first tower's pointer to
        # the Variable.
        v = grad_and_vars[0][1]
        grad_and_var = (grad, v)
        average_grads.append(grad_and_var)
    return average_grads

The purpose of this function is to grab each trainable variable in GPUs and do the average calculation. Here I use the fake data to show what the function average_gradients do and print out the result in details. Hope this way can help the readers to get understand it.
At least, it works for me!

import numpy as np

average_grads = []

# This is the fake data for tower_grads
# we assume it has 3 variables in the model and uses 4 gpus
# so that the tower_grads will look like the following list:
tower_grads = [
[('grad0_gpu0', 'var0_gpu0'), ('grad1_gpu0', 'var1_gpu0') , ('grad2_gpu0', 'var2_gpu0')],
[('grad0_gpu1', 'var0_gpu1'), ('grad1_gpu1', 'var1_gpu1') , ('grad2_gpu1', 'var2_gpu1')],
[('grad0_gpu2', 'var0_gpu2'), ('grad1_gpu2', 'var1_gpu2') , ('grad2_gpu2', 'var2_gpu2')],
[('grad0_gpu3', 'var0_gpu3'), ('grad1_gpu3', 'var1_gpu3') , ('grad2_gpu3', 'var2_gpu3')]]


for grad_and_vars in zip(*tower_grads):
  grads = []
  for g, _ in grad_and_vars:
    # Add 0 dimension to the gradients to represent the tower.
    expanded_g = np.expand_dims(g, 0)
    
    # Append on a 'tower' dimension which we will average over below.
    grads.append(expanded_g)

  # Average over the 'tower' dimension.
  grad = "Avg: " + str(grads)
  print grad
  
  v = grad_and_vars[0][1]
  grad_and_var = (grad, v)
  average_grads.append(grad_and_var)
  
print average_grads

<<<grad>>>
Avg: ['grad0_gpu0', 'grad0_gpu1', 'grad0_gpu2', 'grad0_gpu3']
Avg: ['grad1_gpu0', 'grad1_gpu1', 'grad1_gpu2', 'grad1_gpu3']
Avg: ['grad2_gpu0', 'grad2_gpu1', 'grad2_gpu2', 'grad2_gpu3']


<<<average_grads>>>
[
(Avg: ['grad0_gpu0', 'grad0_gpu1', 'grad0_gpu2', 'grad0_gpu3'], 'var0_gpu0'),
(Avg: ['grad1_gpu0', 'grad1_gpu1', 'grad1_gpu2', 'grad1_gpu3'], 'var0_gpu0'),
(Avg: ['grad2_gpu0', 'grad2_gpu1', 'grad2_gpu2', 'grad2_gpu3'], 'var0_gpu0')
]


P.S:
Here is a simple example to show zip function in Python:

accum_slots = [ "accm_g1", "accm_g2", "accm_g3", "accm_g4", "accm_g5", "accm_g6", "accm_g7"]
grads_and_vars = [ ("g1", "v1"), ("g2", "v2"), ("g3", "v3"), ("g4", "v4"), ("g5", "v5"), ("g6", "v6"), ("g7", "v7")]

for s, (g, _) in zip(accum_slots, grads_and_vars):
  print(s, g)

('accm_g1', 'g1')
('accm_g2', 'g2')
('accm_g3', 'g3')
('accm_g4', 'g4')
('accm_g5', 'g5')
('accm_g6', 'g6')
('accm_g7', 'g7')

Wednesday, November 7, 2018

[Dynamic Control Flow] Whitepaper: Implementation of Control Flow in TensorFlow

In the following whitepaper, we can understand more dynamic control flow in details.
Whitepaper: Implementation of Control Flow in TensorFlow
http://download.tensorflow.org/paper/white_paper_tf_control_flow_implementation_2017_11_1.pdf

Because of so many points in it, here I only want to mention the point of memory swapping related as follows:
"To reuse forward values in backprop loop, we automatically detect, during the construction of
the backprop while loop, the forward values that are needed in the backprop. For each such
forward value x, we automatically introduce a stack and add nodes in the forward loop to save
its value at each iteration to the stack. The backprop loop uses the values from the stack in the
reverse order. The stack lives outside the forward and backprop loops and is shared by the
two loops."


And this stack push operation can be used in While_Loop operation, and TensorFlow will generate stack pop one in the backpropagation phase. If you check the source code, it can be found in GradLoopState Class.

tensorflow/python/ops/control_flow_ops.py
780 class GradLoopState(object):
781   """The state used for constructing the gradient graph for a while loop.
782 
783   We create a GradLoopState for each while loop in forward and its
784   corresponding while loop in backprop. This gives us access to both
785   the forward and the backprop WhileContexts.
786 
787   During the construction of gradient graph, any time when we detect
788   a forward value that is needed for backprop, we create a history
789   accumulator and add it to `history_map`. Any time when we backprop
790   a loop switch op (in _SwitchGrad), we add the grad merge op in
791   `switch_map`.
792   """
...
...
For more explanation and experiments in dynamic control flow, please refer to the paper:
Dynamic Control Flow in Large-Scale Machine Learning

Tuesday, October 30, 2018

[TensorFlow] Train in Tensorflow and do inference with the trained model

If you want to train your model in Tensorflow and do inference with the trained model, you can refer to this post.

1. Train your model

I will use the simple CNN model in my previous post:
[ONNX] Train in Tensorflow and export to ONNX (Part II)
https://danny270degree.blogspot.com/2018/08/onnx-train-in-tensorflow-and-export-to_20.html

So, after training, you will get these files:
my_mnist/
├── checkpoint
├── graph.pbtxt
├── my_mnist_model.data-00000-of-00001
├── my_mnist_model.index
└── my_mnist_model.meta

2. Freeze graph

W need to run tensorflow/python/tools/freeze_graph.py to convert the checkpoint values into embedded constants within the graph file itself. Here I use another script to freeze the model by TensorFlow's freeze_graph:  ( the format of the input graph is text format )
bazel build tensorflow/python/tools:freeze_graph
bazel-bin/tensorflow/python/tools/freeze_graph \
    --input_graph=/danny/pyutillib/my_mnist/graph.pbtxt \
    --input_checkpoint=/danny/pyutillib/my_mnist/my_mnist_model \
    --output_graph=/danny/pyutillib/frozen_graph.pb \
    --output_node_names=output/output/BiasAdd \
    --input_binary=False
my_mnist
|-- checkpoint
|-- frozen_graph.pb  <== it will be generated.
|-- graph.pbtxt
|-- my_mnist_model.data-00000-of-00001
|-- my_mnist_model.index
`-- my_mnist_model.meta

P.S: The difficult part is to find out the output node name and input node name in the further using.

3. Transform graph

Here we use the graph transform tool from TensorFlow. For more in details, please check out this document:
https://github.com/tensorflow/tensorflow/tree/master/tensorflow/tools/graph_transforms
bazel build tensorflow/tools/graph_transforms:transform_graph
bazel-bin/tensorflow/tools/graph_transforms/transform_graph \
--in_graph='/danny/pyutillib/my_mnist/frozen_graph.pb' \
--out_graph='/danny/pyutillib/my_mnist/optimized_frozen_graph.pb' \
--inputs='inputs/X:0' \
--outputs='output/output/BiasAdd:0' \
--transforms='
  strip_unused_nodes
  fold_constants
  fold_batch_norms
  fold_old_batch_norms
  quantize_weights'
my_mnist/
|-- checkpoint
|-- frozen_graph.pb
|-- graph.pbtxt
|-- my_mnist_model.data-00000-of-00001
|-- my_mnist_model.index
|-- my_mnist_model.meta
`-- optimized_frozen_graph.pb <==  it will be generated.

4. Do inference for your model

I will use this brief example to do the inference using your frozen and optimized model.
import argparse
import tensorflow as tf
import numpy as np
import tensorflow.examples.tutorials.mnist.input_data as input_data

n_input = 784 # MNIST data input (img shape: 28*28)
n_classes = 10 # MNIST total classes (0-9 digits)

def load_graph(frozen_graph_filename):
    # We load the protobuf file from the disk and parse it to retrieve the
    # unserialized graph_def
    with tf.gfile.GFile(frozen_graph_filename, "rb") as f:
        graph_def = tf.GraphDef()
        graph_def.ParseFromString(f.read())

    # Then, we import the graph_def into a new Graph and returns it
    with tf.Graph().as_default() as graph:
        # The name var will prefix every op/nodes in your graph
        # Since we load everything in a new graph, this is not needed
        tf.import_graph_def(graph_def, name="prefix")
    return graph

if __name__ == '__main__':
    parser = argparse.ArgumentParser()
    parser.add_argument("--frozen_model_filename", default="./optimized_frozen_graph.pb", type=str, help = "Quantized/Frozen model to import")
    args = parser.parse_args()

    graph = load_graph(args.frozen_model_filename)
    #for op in graph.get_operations():
    #    print(op.name)
    input_node  = graph.get_tensor_by_name('prefix/inputs/X:0')
    output_node = graph.get_tensor_by_name('prefix/output/output/BiasAdd:0')
    
    picture = np.ones([1, 784])
    print('picture:', picture)
    with tf.Session(graph=graph) as sess:
        _ = sess.run(output_node, feed_dict={input_node: picture})
        for _output in _:
            print("result:", np.argmax(_output))

Run this example:
$ python mnist_inference.py --frozen_model_filename ./my_mnist/optimized_frozen_graph.pb

('picture:', array([[1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
        1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
        1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
        1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
        1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
        1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
        1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
        1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
        1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
        1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
        1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
        1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
        1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
        1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
        1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
        1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
        1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
        1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
        1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
        1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
        1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
        1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
        1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
        1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
        1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
        1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
        1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
        1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
        1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
        1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
        1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
        1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
        1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
        1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
        1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
        1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
        1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
        1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
        1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
        1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
        1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
        1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
        1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
        1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
        1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
        1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
        1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
        1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
        1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.]]))
...
...
('result:', 5)

Or we can use the grayscale image(28x28) which is drug by myself as follow:


Then, we need to change the code a little bit for our image
    picture = cv2.imread("my_mnist/2.png", cv2.IMREAD_GRAYSCALE)
    print('picture:', picture)
    picture = picture.reshape(1, 784) # from (28, 28) to (1, 784)
    with tf.Session(graph=graph) as sess:
        _ = sess.run(output_node, feed_dict={input_node: picture})
        for _output in _:
            print("result:", np.argmax(_output))

Finally, we will get the result: 2
('picture:', array([[  0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,  0,   0],
                    [  0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,  0,   0],
                    [  0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,  0,   0],
                    [  0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,  0,   0],
                    [  0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,  0,   0],
                    [  0,   0,   0,   0,   0,   0,   0,   0,   0, 255, 255, 255, 255, 255, 255,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,  0,   0],
                    [  0,   0,   0,   0,   0,   0,   0,   0, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255,   0,   0,   0,   0,   0,   0,   0,   0,  0,   0],
                    [  0,   0,   0,   0,   0,   0,   0,   0, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255,   0,   0,   0,   0,   0,   0,   0,  0,   0],
                    [  0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0, 255, 255, 255, 255, 255,   0,   0,   0,   0,   0,   0,   0,  0,   0],
                    [  0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0, 255, 255, 255, 255, 255,   0,   0,   0,   0,   0,   0,   0,  0,   0],
                    [  0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0, 255, 255, 255, 255,   0,   0,   0,   0,   0,   0,   0,   0,  0,   0],
                    [  0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,  0,  255, 255, 255,   0,   0,   0,   0,   0,   0,   0,   0,   0,  0,   0],
                    [  0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0, 255, 255, 255, 255,   0,   0,   0,   0,   0,   0,   0,   0,   0,  0,   0],
                    [  0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0, 255, 255, 255, 255, 255,   0,   0,   0,   0,   0,   0,   0,   0,   0,  0,   0],
                    [  0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0, 255, 255, 255, 255, 255,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,  0,   0],
                    [  0,   0,   0,   0,   0,   0,   0,   0, 255, 255, 255, 255, 255, 255, 255,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,  0,   0],
                    [  0,   0,   0,   0,   0,   0, 255, 255, 255, 255, 255, 255, 255, 255,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,  0,   0],
                    [  0,   0,   0,   0,   0, 255, 255, 255, 255, 255, 255,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,  0,   0],
                    [  0,   0,   0,   0, 255, 255, 255, 255, 255, 255,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,  0,   0],
                    [  0,   0,   0,   0, 255, 255, 255,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,  0,   0],
                    [  0,   0,   0,   0, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255,   0,   0,   0,   0,   0,   0,   0, 255,   0,   0,   0,   0,  0,   0],
                    [  0,   0,   0,   0, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255,   0,   0,   0,  0,   0],
                    [  0,   0,   0,   0,   0, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255,   0,   0,   0,   0,  0,   0],
                    [  0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0, 255, 255, 255, 255, 255, 255, 255, 255, 255,   0,   0,   0,   0,   0,  0,   0],
                    [  0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,  0,   0],
                    [  0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,  0,   0],
                    [  0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,  0,   0],
                    [  0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,  0,   0]], 
     dtype=uint8))

...
...

('result:', 2)


Yes, I get the result as the same as we expect!

P.S: If you trained your model in NCHW data format, your model will not be able to do inference in CPU environment. And, it cannot be converted to NHWC.
Someone tried to convert the data format, but cannot work. Please check out the link:
https://stackoverflow.com/questions/47014306/freeze-graph-with-different-data-format?rq=1

P.S: Here is another graph tool: optimize_for_inference, but it's old one.
TransformGraph is the new one.
https://stackoverflow.com/questions/45382917/how-to-optimize-for-inference-a-simple-saved-tensorflow-1-0-1-graph
http://bbs.bugcode.cn/t/63211



Wednesday, October 24, 2018

[LLVM] LLVM studying list for newbie

If you are an LLVM newbie and are interested in LLVM like me, you may take a look at my LLVM studying list. It takes time for me to search the related resources and documents. So, I think it will help somehow. By the way, most of my list items are written in Chinese so that those who are native Engish speakers may not suit for this.

1. LLVM installation and concept

The first time to try LLVM ( in Chinese )
https://medium.com/@zetavg/%E7%B7%A8%E8%AD%AF%E5%99%A8-llvm-%E6%B7%BA%E6%B7%BA%E7%8E%A9-42a58c7a7309

llvm之旅第一站 - 编译及简单使用,llvm,clang

llvm之旅第二站 - 环境配置,llvm,clang

llvm之旅第三站- 認識LLVM IR,llvm,clang

LLVM Tutorial  ( in English )
https://llvm.org/docs/tutorial/

LLVM Language Reference Manual
http://llvm.org/docs/LangRef.html

COSCUP2016 - LLVM框架、由淺入淺 ( Slideshare )
https://www.slideshare.net/hydai/coscup2016-llvm

LLVM introduction
https://people.cs.nctu.edu.tw/~chenwj/dokuwiki/doku.php?id=llvm

2. LLVM Pass

llvm之旅第四站 - 编写Pass,llvm,clang
http://www.nagain.com/activity/article/14/

llvm之旅第五站 - 调试Pass,llvm,clang
http://www.nagain.com/activity/article/20/

llvm:Call Graph And Control Flow Graph | PCB博客
http://blog.binpang.me/2017/05/20/llvm-CGAndCFG/

3. LLVM more in-depth

LLVM

Synthesis Flow @ LLVM 2.8
http://funningboy.blogspot.com/2011/02/

使用 LLVM 框架创建一个工作编译器,第 1 部分
https://www.ibm.com/developerworks/cn/opensource/os-createcompilerllvm1/


Tuesday, October 23, 2018

[TensorFlow] Does it help the processing time and transmission time if increasing CUDA Steam number in TensorFlow?

Before starting to increase the CUDA Steam number in TensorFlow, I want to recap some ideas about the Executor module. When TensorFlow session runs, it will build Executor. Meanwhile, if you enable CUDA in TensorFlow build configuration, the Executor will add visible GPU devices and create TF device object (GPUDevice object) mapping to physical GPU device. There are 4 kinds of streams inside GPUDevice:

  • CUDA stream 
  • Host_to_Device stream
  • Device_to_Host stream
  • Device_to_Device stream

By default, these 4 kinds of streams only will have 1 stream for each. Please check out the following code:
class GPUDevice : public BaseGPUDevice {
 public:
  GPUDevice(const SessionOptions& options, const string& name,
            Bytes memory_limit, const DeviceLocality& locality,
            TfGpuId tf_gpu_id, const string& physical_device_desc,
            Allocator* gpu_allocator, Allocator* cpu_allocator)
      : BaseGPUDevice(options, name, memory_limit, locality, tf_gpu_id,
                      physical_device_desc, gpu_allocator, cpu_allocator,
                      false /* sync every op */, 1 /* max_streams */) {
    if (options.config.has_gpu_options()) {
      force_gpu_compatible_ =
          options.config.gpu_options().force_gpu_compatible();
    }
  }
  ...
  ...

If I change it to 2, does it help to improve the training or inference speed? In my experiment, the answer is "No". Please see the pictures below:

My case does a lot of memcpy between GPU and CPU devices and "stream=2" doesn't help to improve the processing time and transmission time. The result also makes sense because the bottleneck is in GPU SM for data processing time and PCIe for data transmission time.




Wednesday, October 17, 2018

[TensorFlow Grappler] How to do the topological sorting in TensorFlow Grappler?

If you try to implement some optimizers in TensorFlow Grappler, you must have to know how to deal with the directed computation graph. One of the most important tools/knowledges is topological sorting.
The definition from Wiki: Topological sorting
https://en.wikipedia.org/wiki/Topological_sorting
"In the field of computer science, a topological sort or topological ordering of a directed graph is a linear ordering of its vertices such that for every directed edge uv from vertex u to vertex v, u comes before v in the ordering."

In TensorFlow, there are topological sort related functions already in the following link and we can take advantage of them.
https://github.com/tensorflow/tensorflow/blob/master/tensorflow/core/grappler/utils/topological_sort.cc
First of all, we have this directed computation graph and can see the node: "fc1/fc1/MatMul"


Here is an example to dump the order of nodes in the graph using topological sorting
static std::unordered_map<const NodeDef*, int> GetTopoOrdering(GrapplerItem* item) {
  std::unordered_map<const NodeDef*, int> topo_order;
  ComputeTopologicalOrder(item->graph, &topo_order, nullptr);
  for ( auto& n : topo_order){
    const string& node_name = n.first->name();
    const int order = n.second;
    VLOG(1) << "...[DEBUG2] Node " << node_name << " at TopoOrdering order " << order;
  }
  return topo_order;
}

Result:
...[DEBUG2] Node train/Adam at TopoOrdering order 130
...[DEBUG2] Node train/Adam/Assign at TopoOrdering order 128
...[DEBUG2] Node train/Adam/mul at TopoOrdering order 126
...[DEBUG2] Node train/Adam/update_conv1/bias/ApplyAdam at TopoOrdering order 124
...[DEBUG2] Node train/gradients/conv1/BiasAdd_grad/BiasAddGrad at TopoOrdering order 122
...[DEBUG2] Node train/Adam/update_conv2/kernel/ApplyAdam at TopoOrdering order 121
...[DEBUG2] Node train/gradients/conv1/Relu_grad/ReluGrad at TopoOrdering order 120
...[DEBUG2] Node train/Adam/update_conv1/kernel/ApplyAdam at TopoOrdering order 125
...[DEBUG2] Node train/gradients/conv2/Conv2D_grad/Conv2DBackpropFilter at TopoOrdering order 119
...[DEBUG2] Node train/gradients/conv2/Relu_grad/ReluGrad at TopoOrdering order 115
...[DEBUG2] Node train/gradients/pool3/dropout/cond/dropout/div_grad/tuple/control_dependency at TopoOrdering order 109
...[DEBUG2] Node train/gradients/pool3/dropout/cond/dropout/div_grad/RealDiv at TopoOrdering order 108
...[DEBUG2] Node train/gradients/conv1/Conv2D_grad/Conv2DBackpropFilter at TopoOrdering order 123
...[DEBUG2] Node train/gradients/pool3/dropout/cond/Identity/Switch_grad/cond_grad at TopoOrdering order 106
...[DEBUG2] Node train/Adam/update_fc1/kernel/ApplyAdam at TopoOrdering order 105
...[DEBUG2] Node train/gradients/pool3/MaxPool_grad/MaxPoolGrad-2-TransposeNHWCToNCHW-LayoutOptimizer at TopoOrdering order 113
...[DEBUG2] Node train/gradients/pool3/dropout/cond/Merge_grad/cond_grad at TopoOrdering order 104
...[DEBUG2] Node train/Adam/update_output/kernel/ApplyAdam at TopoOrdering order 99
...[DEBUG2] Node train/gradients/fc1/fc1/Relu_grad/ReluGrad at TopoOrdering order 98
...[DEBUG2] Node train/gradients/output/output/MatMul_grad/MatMul at TopoOrdering order 95
...[DEBUG2] Node train/gradients/conv2/BiasAdd_grad/BiasAddGrad at TopoOrdering order 116
...[DEBUG2] Node train/gradients/output/output/BiasAdd_grad/BiasAddGrad at TopoOrdering order 94
...[DEBUG2] Node output/output/BiasAdd at TopoOrdering order 90
...[DEBUG2] Node output/output/MatMul at TopoOrdering order 89
...[DEBUG2] Node fc1/fc1/BiasAdd at TopoOrdering order 87
...[DEBUG2] Node fc1/fc1/Relu at TopoOrdering order 88
...[DEBUG2] Node pool3/dropout/cond/Merge at TopoOrdering order 85
...[DEBUG2] Node pool3/dropout/cond/dropout/mul at TopoOrdering order 82
...[DEBUG2] Node train/gradients/fc1/fc1/MatMul_grad/MatMul at TopoOrdering order 101
...[DEBUG2] Node train/gradients/zeros/Const at TopoOrdering order 80
...[DEBUG2] Node train/gradients/Shape_1 at TopoOrdering order 79
...[DEBUG2] Node pool3/dropout/cond/dropout/div at TopoOrdering order 78
...[DEBUG2] Node ConstantFoldingCtrl/pool3/dropout/cond/dropout/div/Switch_1 at TopoOrdering order 77
...[DEBUG2] Node train/gradients/Switch at TopoOrdering order 76
...[DEBUG2] Node pool3/dropout/cond/dropout/div/Switch at TopoOrdering order 75
...[DEBUG2] Node pool3/MaxPool-1-0-TransposeNCHWToNHWC-LayoutOptimizer at TopoOrdering order 73
...[DEBUG2] Node pool3/dropout/cond/dropout/Floor at TopoOrdering order 72
...[DEBUG2] Node pool3/MaxPool at TopoOrdering order 71
...[DEBUG2] Node ConstantFolding/train/gradients/conv2/Conv2D_grad/ShapeN-matshapes-1 at TopoOrdering order 67
...[DEBUG2] Node conv2/BiasAdd at TopoOrdering order 66
...[DEBUG2] Node train/gradients/pool3/dropout/cond/dropout/mul_grad/Mul at TopoOrdering order 107
...[DEBUG2] Node train/gradients/conv2/Conv2D_grad/ShapeN at TopoOrdering order 64
...[DEBUG2] Node pool3/dropout/cond/dropout/random_uniform/RandomUniform at TopoOrdering order 62
...[DEBUG2] Node pool3/dropout/cond/dropout/Shape at TopoOrdering order 59
...[DEBUG2] Node train/gradients/conv2/Conv2D_grad/Conv2DBackpropInput at TopoOrdering order 118
...[DEBUG2] Node pool3/dropout/cond/dropout/keep_prob at TopoOrdering order 58
...[DEBUG2] Node conv1/BiasAdd at TopoOrdering order 57
...[DEBUG2] Node ConstantFolding/train/gradients/conv1/Conv2D_grad/ShapeN-matshapes-1 at TopoOrdering order 54
...[DEBUG2] Node train/gradients/conv1/Conv2D_grad/Conv2DBackpropFilter-0-TransposeNHWCToNCHW-LayoutOptimizer at TopoOrdering order 52
...[DEBUG2] Node conv1/Conv2D-0-TransposeNHWCToNCHW-LayoutOptimizer at TopoOrdering order 51
...[DEBUG2] Node train/Adam/mul_1 at TopoOrdering order 127
...[DEBUG2] Node train/beta2_power/read at TopoOrdering order 50
...[DEBUG2] Node train/gradients/zeros_1 at TopoOrdering order 84
...[DEBUG2] Node train/beta1_power/read at TopoOrdering order 49
...[DEBUG2] Node train/Adam/update_fc1/bias/ApplyAdam at TopoOrdering order 103
...[DEBUG2] Node train/gradients/output/output/MatMul_grad/MatMul_1 at TopoOrdering order 96
...[DEBUG2] Node output/bias/read at TopoOrdering order 48
...[DEBUG2] Node output/kernel/read at TopoOrdering order 47
...[DEBUG2] Node fc1/bias/read at TopoOrdering order 46
...[DEBUG2] Node fc1/kernel/read at TopoOrdering order 45
...[DEBUG2] Node conv2/kernel/read at TopoOrdering order 43
...[DEBUG2] Node conv1/bias/read at TopoOrdering order 42
...[DEBUG2] Node conv1/kernel/read at TopoOrdering order 41
...[DEBUG2] Node train/gradients/fc1/fc1/MatMul_grad/MatMul_1 at TopoOrdering order 102
...[DEBUG2] Node inputs/training at TopoOrdering order 40
...[DEBUG2] Node inputs/Reshape at TopoOrdering order 39
...[DEBUG2] Node PermConstNCHWToNHWC-LayoutOptimizer at TopoOrdering order 38
...[DEBUG2] Node pool3/dropout/cond/dropout/random_uniform at TopoOrdering order 68
...[DEBUG2] Node PermConstNHWCToNCHW-LayoutOptimizer at TopoOrdering order 37
...[DEBUG2] Node train/SparseSoftmaxCrossEntropyWithLogits/SparseSoftmaxCrossEntropyWithLogits at TopoOrdering order 91
...[DEBUG2] Node train/Adam/epsilon at TopoOrdering order 36
...[DEBUG2] Node conv2/Conv2D at TopoOrdering order 63
...[DEBUG2] Node train/Adam/beta2 at TopoOrdering order 35
...[DEBUG2] Node train/Adam/update_output/bias/ApplyAdam at TopoOrdering order 97
...[DEBUG2] Node train/Adam/beta1 at TopoOrdering order 34
...[DEBUG2] Node train/gradients/train/SparseSoftmaxCrossEntropyWithLogits/SparseSoftmaxCrossEntropyWithLogits_grad/PreventGradient at TopoOrdering order 92
...[DEBUG2] Node pool3/Reshape at TopoOrdering order 74
...[DEBUG2] Node conv2/Relu at TopoOrdering order 69
...[DEBUG2] Node output/bias/Adam_1 at TopoOrdering order 32
...[DEBUG2] Node train/gradients/zeros at TopoOrdering order 83
...[DEBUG2] Node output/bias/Adam at TopoOrdering order 31
...[DEBUG2] Node output/kernel/Adam_1 at TopoOrdering order 30
...[DEBUG2] Node output/kernel/Adam at TopoOrdering order 29
...[DEBUG2] Node fc1/fc1/MatMul at TopoOrdering order 86
...[DEBUG2] Node fc1/bias/Adam_1 at TopoOrdering order 28
...[DEBUG2] Node train/gradients/pool3/MaxPool_grad/MaxPoolGrad at TopoOrdering order 114
...[DEBUG2] Node conv1/Relu at TopoOrdering order 61
...[DEBUG2] Node fc1/bias/Adam at TopoOrdering order 27
...[DEBUG2] Node train/gradients/pool3/Reshape_grad/Reshape at TopoOrdering order 112
...[DEBUG2] Node fc1/kernel/Adam_1 at TopoOrdering order 26
...[DEBUG2] Node train/gradients/pool3/dropout/cond/dropout/div/Switch_grad/cond_grad at TopoOrdering order 110
...[DEBUG2] Node pool3/dropout/cond/dropout/add at TopoOrdering order 70
...[DEBUG2] Node fc1/kernel/Adam at TopoOrdering order 25
...[DEBUG2] Node conv2/bias/Adam_1 at TopoOrdering order 24
...[DEBUG2] Node train/gradients/train/SparseSoftmaxCrossEntropyWithLogits/SparseSoftmaxCrossEntropyWithLogits_grad/mul at TopoOrdering order 93
...[DEBUG2] Node train/gradients/Shape_2 at TopoOrdering order 81
...[DEBUG2] Node conv2/bias/Adam at TopoOrdering order 23
...[DEBUG2] Node train/Adam/update_conv2/bias/ApplyAdam at TopoOrdering order 117
...[DEBUG2] Node train/gradients/AddN at TopoOrdering order 111
...[DEBUG2] Node conv2/kernel/Adam_1 at TopoOrdering order 22
...[DEBUG2] Node conv2/kernel/Adam at TopoOrdering order 21
...[DEBUG2] Node pool3/dropout/cond/dropout/random_uniform/mul at TopoOrdering order 65
...[DEBUG2] Node conv1/bias/Adam_1 at TopoOrdering order 20
...[DEBUG2] Node conv1/kernel/Adam_1 at TopoOrdering order 18
...[DEBUG2] Node train/Adam/Assign_1 at TopoOrdering order 129
...[DEBUG2] Node conv1/kernel/Adam at TopoOrdering order 17
...[DEBUG2] Node train/beta2_power at TopoOrdering order 16
...[DEBUG2] Node train/beta1_power at TopoOrdering order 15
...[DEBUG2] Node train/gradients/pool3/Reshape_grad/Shape at TopoOrdering order 14
...[DEBUG2] Node train/gradients/train/SparseSoftmaxCrossEntropyWithLogits/SparseSoftmaxCrossEntropyWithLogits_grad/ExpandDims at TopoOrdering order 13
...[DEBUG2] Node pool3/dropout/cond/switch_t at TopoOrdering order 56
...[DEBUG2] Node conv1/bias/Adam at TopoOrdering order 19
...[DEBUG2] Node output/bias at TopoOrdering order 12
...[DEBUG2] Node fc1/bias at TopoOrdering order 10
...[DEBUG2] Node train/Adam/learning_rate at TopoOrdering order 33
...[DEBUG2] Node fc1/kernel at TopoOrdering order 9
...[DEBUG2] Node pool3/Reshape/shape at TopoOrdering order 8
...[DEBUG2] Node conv2/bias at TopoOrdering order 7
...[DEBUG2] Node pool3/dropout/cond/Switch at TopoOrdering order 53
...[DEBUG2] Node conv2/kernel at TopoOrdering order 6
...[DEBUG2] Node conv1/bias at TopoOrdering order 5
...[DEBUG2] Node conv1/kernel at TopoOrdering order 4
...[DEBUG2] Node conv1/Conv2D at TopoOrdering order 55
...[DEBUG2] Node inputs/training/input at TopoOrdering order 3
...[DEBUG2] Node output/kernel at TopoOrdering order 11
...[DEBUG2] Node inputs/y at TopoOrdering order 2
...[DEBUG2] Node train/gradients/fc1/fc1/BiasAdd_grad/BiasAddGrad at TopoOrdering order 100
...[DEBUG2] Node ConstantFolding/pool3/dropout/cond/dropout/div_recip at TopoOrdering order 60
...[DEBUG2] Node inputs/Reshape/shape at TopoOrdering order 1
...[DEBUG2] Node conv2/bias/read at TopoOrdering order 44
...[DEBUG2] Node inputs/X at TopoOrdering order 0


So, we can see the sequence of the following nodes in the directed computation graph:
fc1/fc1/MatMul ==> fc1/fc1/BiasAdd ==> fc1/fc1/Relu
And the topological sorting method gives us the same sequence

...[DEBUG2] Node fc1/fc1/BiasAdd at TopoOrdering order 87
...[DEBUG2] Node fc1/fc1/Relu at TopoOrdering order 88
...[DEBUG2] Node fc1/fc1/MatMul at TopoOrdering order 86

[Tool] To draw a sequence diagram using online tool sequencediagram

This website provides an online free tool for users to draw the sequence diagram as follows:
https://sequencediagram.org/

Basically, you can follow the instructions at the left top corner button. Check it out.
Here is my example of the sequence diagram about tracing some source codes of XLA AOT in TensorFlow.

title TFCompile
participant tfcompile_main.cc Main
participant compile.cc
participant tf2xla.cc
participant XlaCompiler
participant CompileOnlyClient
participant CompileOnlyService

tfcompile_main.cc Main->compile.cc:CompileGraph(graph_def, config, flags, &compile_result)
compile.cc->tf2xla.cc:ConvertGraphDefToXla(graph_def, config, client, &computation)
tf2xla.cc->tf2xla.cc:InitGraph(graph_def, config, &graph)
tf2xla.cc->tf2xla.cc:ConvertGraphToXla(std::move(graph), client, computation)
tf2xla.cc->XlaCompiler:compiler.CompileGraph(XlaCompiler::CompileOptions(),\n"tfcompile", std::move(graph), xla_args, &result)

compile.cc->compile.cc:CompileXla(client, computation, aot_opts, compile_result)
compile.cc->CompileOnlyClient:client->CompileAheadOfTime({instance}, aot_opts)\n\n=============\n  std::vector<CompileOnlyService::AotXlaComputationInstance> service_instances;\n  service_instances.reserve(computations.size());\n  for (const AotXlaComputationInstance& instance : computations) {\n    service_instances.emplace_back();\n    CompileOnlyService::AotXlaComputationInstance& service_instance =\n        service_instances.back();\n    TF_RET_CHECK(instance.computation != nullptr);\n    service_instance.computation = instance.computation->proto();\n    service_instance.argument_layouts = instance.argument_layouts;\n    service_instance.result_layout = instance.result_layout;\n  }\n  return compiler_service_->CompileAheadOfTime(service_instances, options, metadata);\n
CompileOnlyClient->CompileOnlyService:compiler_service_->CompileAheadOfTime(service_instances, options, metadata)