Tuesday, July 17, 2018

[Confusion Matrix] How to calculate confusion matrix, precision and recall list from scratch

I directly give an example which is with 10 categories, such as CIFAR-10 and MNIST. It explains how to calculate the confusion matrix, precision and recall list from scratch in Python. My data is generated at random. You should replace by yours. Here it goes:

import numpy
import json

CATEGORY = 10
SAMPLES = 1000
label_list = [i for i in range(CATEGORY)]

pred_list = numpy.random.randint(0, CATEGORY-1, size=SAMPLES)
y_batch_list = numpy.random.randint(0, CATEGORY-1, size=SAMPLES)
print(pred_list, y_batch_list)

class confusion_matrix:
  def __init__(self, pred_list, y_batch_list, label_list):
    if len(pred_list) != len(y_batch_list):
      raise Exception('Prediction length is different from Label list!')
    self.pred_list = pred_list
    self.y_batch_list = y_batch_list
    self.matrix_size = len(label_list)
    
    # this matrix are 2 dimensions(y_batch, pred)
    self.confusion_matrix = [[ x*0 for x in range(self.matrix_size)] for y in range(self.matrix_size)]
    self.precision_list = [x*0 for x in range(self.matrix_size)]
    self.recall_list = [x*0 for x in range(self.matrix_size)]

  def calculate_confusion_matrix(self):
    for i in range(len(self.pred_list)):
      # dimension => [y_batch, pred]
      self.confusion_matrix[self.y_batch_list[i]][self.pred_list[i]] += 1

  def calculate_recall_precision_list(self):
    # calculate recall
    for i in range(self.matrix_size):
      tmp_value = 0
      for j in range(self.matrix_size):
        tmp_value += self.confusion_matrix[i][j]
        if tmp_value is not 0:
          self.recall_list[i] = float(self.confusion_matrix[i][i]) / tmp_value

    # calculate precision
    for j in range(self.matrix_size):
      tmp_value = 0
      for i in range(self.matrix_size):
        tmp_value += self.confusion_matrix[i][j]
        if tmp_value is not 0:
          self.precision_list[j] = float(self.confusion_matrix[j][j]) / tmp_value


  def gen_json_data(self):
    data = {'confusion_matrix': self.confusion_matrix,
            'precision_list': self.precision_list,
            'recall_list': self.recall_list
           }
    return data

ret = confusion_matrix(pred_list.tolist(), y_batch_list.tolist(), label_list)
ret.calculate_confusion_matrix()
ret.calculate_recall_precision_list()

Result:
print(ret.gen_json_data())
{'precision_list': [0.0625, 0.14912280701754385, 0.02654867256637168, 0.1452991452991453, 0.07377049180327869, 0.10526315789473684, 0.11320754716981132, 0.13, 0.13725490196078433, 0], 
 'confusion_matrix': [[7, 14, 10, 15, 17, 19, 17, 18, 14, 0], [10, 17, 14, 9, 5, 11, 9, 12, 12, 0], [11, 11, 3, 19, 16, 13, 4, 11, 7, 0], [13, 18, 16, 17, 13, 12, 11, 11, 12, 0], [15, 12, 15, 14, 9, 13, 17, 9, 11, 0], [19, 8, 11, 11, 17, 12, 13, 10, 8, 0], [9, 9, 10, 11, 14, 11, 12, 7, 15, 0], [20, 14, 13, 10, 18, 10, 11, 13, 9, 0], [8, 11, 21, 11, 13, 13, 12, 9, 14, 0], [0, 0, 0, 0, 0, 0, 0, 0, 0, 0]], 
 'recall_list': [0.05343511450381679, 0.1717171717171717, 0.031578947368421054, 0.13821138211382114, 0.0782608695652174, 0.11009174311926606, 0.12244897959183673, 0.11016949152542373, 0.125, 0]
}


Saturday, July 14, 2018

[Qt5] How to develop Qt5 GUI with TensorFlow C++ library?

Here I give a simple and complete example of how to develop Qt5 GUI with TensorFlow C++ library on Linux platform. Please check out my GitHub's repository as follow:
https://github.com/teyenliu/tf_inference_gui

For building TensorFlow C++ APIs library, you can refer to my previous post:
https://danny270degree.blogspot.com/2018/07/tensorflow-how-to-build-your-c-program.html

I think the key point is how to prepare the CMakeLists.txt and you can refer to mine. If you use Qt Creator to open this project and make it, the GUI will look like this when running.


Monday, July 9, 2018

[TensorFlow] How to implement LMDBDataset in tf.data API?

I have finished implementing the LMDBDataset in tf.data API.  It could be not the bug-free component, but at least it's my first time to try to implement C++ and Python function in TensorFlow. The API architecture looks like this:








The whole implemented code is in my fork's TensorFlow repo with branch r1.8:
https://github.com/teyenliu/tensorflow/tree/r1.8

If you want to see what's implemented, please check it out:
https://github.com/teyenliu/tensorflow/commit/3941debe3001d52fe9a6d4048bd679a5a1f0f075

Basically, it can be used like the way of TFRecordDataset, TextLineDataset. The following is the example to use TFRecordDataset:

By the way, I also provide some samples for those who want to benchmark TFRecordDataset, LMDBDataset or others' performance. Please also check the following:
https://github.com/teyenliu/tensorflow/tree/r1.8/tensorflow/examples/how_tos/reading_data

convert_to_records_lmdb.py: This python file is to convert MNIST data format into lmdb,
which yields datapoints.

fully_connected_reader_lmdb.py: This python file is to train a fully connected neural net with MNIST data in lmdb,
which contains a new argument perf to only measure the performance of input data pipeline.

Example 1: to train on MNIST dataset, you may give the following command:

$ python fully_connected_reader_lmdb.py --train_dir ./lmdb_data --num_epochs 10 --batch_size 128 --perf training
Example 2: to check the performance of data pipeline on MNIST dataset, you may give the following command:
$ python fully_connected_reader_lmdb.py --train_dir ./lmdb_data --num_epochs 10 --batch_size 128 --perf datapipeline

The performance result shows that TFRecordDataset APIs is still faster than others in speed performance test.

Wednesday, July 4, 2018

[TensorFlow] How to build your C++ program or application with TensorFlow library using CMake

When you want to build your  C++ program or application using TensorFlow library or functions, you probably will encounter some header file missed issues or linking problems. Here is the step list that I have verified and it works well.

1. Prepare TensorFlow and its third party's library
$ git clone --recursive https://github.com/tensorflow/tensorflow
$ cd tensorflow/contrib/makefile
$ ./build_all_linux.sh

2. Build TensorFlow C++ APIs library
$ cd tensorflow
$ ./configure
<<< Please based on your requirement to configure the items in this step >>> 
$ bazel build //tensorflow:libtensorflow_cc.so

3. Setup header file and library
$ sudo mkdir /usr/local/tensorflow
$ sudo mkdir /usr/local/tensorflow/include
$ sudo cp -r tensorflow/contrib/makefile/downloads/eigen/Eigen /usr/local/tensorflow/include/
$ sudo cp -r tensorflow/contrib/makefile/downloads/eigen/unsupported /usr/local/tensorflow/include/
$ sudo cp -r tensorflow/contrib/makefile/gen/protobuf/include/google /usr/local/tensorflow/include/
$ sudo cp tensorflow/contrib/makefile/downloads/nsync/public/* /usr/local/tensorflow/include/
$ sudo cp -r bazel-genfiles/tensorflow /usr/local/tensorflow/include/
$ sudo cp -r tensorflow/cc /usr/local/tensorflow/include/tensorflow
$ sudo cp -r tensorflow/core /usr/local/tensorflow/include/tensorflow
$ sudo mkdir /usr/local/tensorflow/include/third_party
$ sudo cp -r third_party/eigen3 /usr/local/tensorflow/include/third_party/
$ sudo mkdir /usr/local/tensorflow/lib
$ sudo cp bazel-bin/tensorflow/libtensorflow_*.so /usr/local/tensorflow/lib

If you finish the steps above, you are able to build your C++ program or application from now on.
Then, I provide a simple project and CMakeLists.txt for your reference as follows:
https://github.com/teyenliu/dnn_tensorflow_cpp

If you git clone it, you will get this in the folder:
├── BUILD
├── CMakeLists.txt
├── data_set.cc
├── data_set.h
├── model.cc
├── normalized_car_features.csv
└── README.md

My CMakeLists.txt is here:
cmake_minimum_required(VERSION 3.0 FATAL_ERROR)
 
set(CMAKE_INCLUDE_CURRENT_DIR ON)
# Instruct CMake to run moc automatically when needed
set(CMAKE_AUTOMOC ON)
# Create code from a list of Qt designer ui files
set(CMAKE_AUTOUIC ON)

project(DNN_Tensorflow_CPP LANGUAGES CXX)

add_executable(${PROJECT_NAME} model.cc data_set.cc data_set.h)

#target_link_libraries(main PRIVATE tensorflow)
configure_file(normalized_car_features.csv ${CMAKE_CURRENT_BINARY_DIR}/normalized_car_features.csv COPYONLY)

if(MSVC)
    target_compile_definitions(main PRIVATE COMPILER_MSVC)
endif(MSVC)

set(CMAKE_CXX_STANDARD 11)
set(CMAKE_CXX_STANDARD_REQUIRED ON)

set(CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS} -std=c++11 -g -fPIC  ")
set(CMAKE_EXE_LINKER_FLAGS  "${CMAKE_EXE_LINKER_FLAGS} " )

include_directories("/usr/local/tensorflow/include/external/nsync/public")
include_directories("/usr/local/tensorflow/include/")

TARGET_LINK_LIBRARIES(${PROJECT_NAME}  "/usr/local/tensorflow/lib/libtensorflow_cc.so")
TARGET_LINK_LIBRARIES(${PROJECT_NAME}  "/usr/local/tensorflow/lib/libtensorflow_framework.so")

Finally, I can build and run it successfully.
$ mkdir build
$ cd build
$ cmake ..
$ make
$ ./DNN_Tensorflow_CPP

P.S: For the more in details about the C++ example, please check out this blog:
https://matrices.io/training-a-deep-neural-network-using-only-tensorflow-c/




Tuesday, June 26, 2018

[XLA JIT] How to turn on XLA JIT compilation at multiple GPUs training

Before I discuss this question, let's recall how to turn on  XLA JIT compilation and use it in TensorFlow python API.

1. Session
Turning on JIT compilation at the session level will result in all possible operators being greedily compiled into XLA computations. Each XLA computation will be compiled into one or more kernels for the underlying device.

2. Manual
JIT compilation can also be turned on manually for one or more operators. This is done by tagging the operators to compile with the attribute _XlaCompile=true. The simplest way to do this is via the tf.contrib.compiler.jit.experimental_jit_scope() scope defined in tensorflow/contrib/compiler/jit.py.

3. Placing operators on XLA devices ( we won't consider this option due to too tedious work)
Another way to run computations via XLA is to place an operator on a specific XLA device. This method is normally only used for testing.

Basically, I have tried first two options in this script: cifar10_multi_gpu_train.py because it already contains codes of multiple GPUs with synchronous updates.

For the first option(Session), it doesn't work with multiple GPUs training. This option will force TensorFlow to compile all possible options into XLA computation and we don't know that the code design in cifar10_multi_gpu_train.py for multiple GPUs with synchronous updates still exists.

But, for the second option(Manual), it can work with multiple GPUs training using JIT scope.

Here is my experiment using second option (Manual with using jit scope)
Source: https://github.com/tensorflow/models/blob/master/tutorials/image/cifar10/cifar10_multi_gpu_train.py
Batch Size: 6000
Total Iterations: 2000

How to turn on XLA JIT compilation in cifar10_multi_gpu_train.py?
#Add jit_scope definition in the early beginning of your code
jit_scope = tf.contrib.compiler.jit.experimental_jit_scope
...
...

#Add jit_scope() scope
    with tf.variable_scope(tf.get_variable_scope()):
      for i in xrange(FLAGS.num_gpus):
        with tf.device('/gpu:%d' % i):
          with jit_scope():  # <-- Add this line
            with tf.name_scope('%s_%d' % (cifar10.TOWER_NAME, i)) as scope:
              # Dequeues one batch for the GPU
              image_batch, label_batch = batch_queue.dequeue()



Case1: Turning on XLA JIT ( using jit scope )
Training time: 910 seconds
Avg. images/sec: 36647
Memory Usage and GPU-Util%:


Case2: Turning off XLA JIT ( no XLA )
Training time: 1120 seconds
Avg. images/sec: 25203
Memory Usage and GPU-Util%:


In Summary:
Turning on XLA JIT at multiple GPUs training in this experiment, the training performance is improved by more than 18%.

Sunday, June 24, 2018

[PCIe] How to read/write PCIe Switch Configuration Space?

Here is a question, how to read/write PCIe Switch Configuration Space? We can see this picture first.



The memory map shows the entire physical address space of the root complex.  Only the green block at the bottom is system DRAM. Those yellow areas above are memory mapped peripherals, including PCIe Switch. So CPU can read PCIe Switch configuration space via MMIO in Host Memory. So Basic Address Registers (BAR) are very important and My laptop doesn't have PCIe switch device so that I just pick up a SATA device and it is a very simple example to read 256 bytes of configuration space as follows:

P.S: If you have PCIe switch device ID, just replace it to the code.

aboutpci.c
/***
reference: http://telesky.pixnet.net/blog/post/7022197-a-simple-linux-driver-example-on-fpga%3A-adder
***/

#include <linux/init.h>
#include <linux/module.h>
#include <linux/pci.h>

MODULE_LICENSE("Dual BSD/GPL");

#define    OUR_SATA_VENDOR_ID    0x14e4
#define    OUR_SATA_PROD_ID    0x165f


void print_addr_func(u8 *src, int size) {
    int i;
    if (size < 0) {
        printk(KERN_ALERT "The size should be greater than 0!\n");
        return;
    }
    
    for(i = 0; i < size; i++) {
        if (! (i & 15))
            printk(KERN_ALERT " %02x:", i);
        printk(KERN_ALERT " %02x", src[i]);
        if ((i & 15) == 15)
            printk(KERN_ALERT "\n");
    }
}


static int aboutpci_init(void)
{
    u8 config_arr[256];
    //int iobase;
    //int iobase_end;
    int i;
    //u8 data_byte = 0;
    //u32 pio_start, pio_end, pio_flags, pio_len = 0;
    unsigned long mmio_start, mmio_end, mmio_flags, mmio_len, ioaddr;
    //u16 data_one_word;
    unsigned int *base_addr, *base_addr_0;

    struct pci_dev *pdev = NULL;

    //Finding the device by Vendor/Device ID Pair
    pdev = pci_get_device(OUR_SATA_VENDOR_ID, OUR_SATA_PROD_ID, pdev);
    if (pdev != NULL) {
        printk(KERN_ALERT "Our SATA HBA found!\n");
        if ( pdev->dma_mask == DMA_BIT_MASK(64) )
            printk(KERN_ALERT "64-bit addressing capable!\n");
        else if ( pdev->dma_mask == DMA_BIT_MASK(32) )
            printk(KERN_ALERT "32-bit addressing capable!\n");
        /* Bus-specific parameters. For a PCI NIC, it looks as follows */
        printk(KERN_ALERT "Use pci_read_config_byte() to print bytes in configuration space\n");
        for(i = 0; i < 256; i++) {
            pci_read_config_byte(pdev, i, &config_arr[i]);
            //printk(KERN_ALERT " %02X ", config_arr[i]);
        }
        print_addr_func(config_arr, 256);

        printk(KERN_ALERT "Use pci_resource_XXX() to access BAR 0\n");
        mmio_start = pci_resource_start (pdev, 0);
        mmio_end = pci_resource_end (pdev, 0);
        mmio_flags = pci_resource_flags (pdev, 0);
        mmio_len = pci_resource_len (pdev, 0);
        
        printk(KERN_ALERT "MMIO region size of BAR 1 is :%lu\n", mmio_len);
 printk(KERN_ALERT "MMIO region base addr is %x\n", mmio_start);

        /* make sure PCI base addr 1 is MMIO */
 if (!(mmio_flags & IORESOURCE_MEM)) {
     printk(KERN_ALERT, "region #1 not an MMIO resource, aborting\n");
 }
        
        // Get BAR0's address
        /* ioremap MMIO region */
 ioaddr = ioremap(mmio_start, mmio_len);
 if (ioaddr == NULL) {
     printk(KERN_ALERT "MMIO region is rrror!! \n");
 }   
 printk(KERN_ALERT "MMIO Remap addr is %x\n", ioaddr);
        // print out the MMIO region content from remap addr (virtual address)
        print_addr_func(ioaddr, 16 /* part of mmio_len */);
    }
    else
        printk(KERN_ALERT "Our SATA HBA Not found!\n");

    //Finding the device by its class code
    pdev = NULL;
    pdev = pci_get_class(PCI_CLASS_STORAGE_SATA_AHCI, pdev);
    if (pdev != NULL) {
        printk(KERN_ALERT "SATA HBA Class device found!\n");
        printk(KERN_ALERT "Device Vendor ID: 0x%X\n", pdev->vendor);
        printk(KERN_ALERT "Device Product ID: 0x%X\n", pdev->device);
     
       /* Bus-specific parameters. For a PCI NIC, it looks as follows */
       //iobase = pci_resource_start(dev, 1);
       //iobase_end = iobase + pci_resource_len(dev, 1);
       //printk(KERN_ALERT "Device class bar0 from: 0x%X to 0x%X\n", iobase, iobase_end);
    }
    else
        printk(KERN_ALERT "SATA HBA Class device Not found!\n");

    return 0;
}

static void aboutpci_exit(void)
{
    printk(KERN_ALERT "Goodbye, pci hackers\n");
}

module_init(aboutpci_init);
module_exit(aboutpci_exit);


Use this Makefile to build your module:
Makefile
ifneq ($(KERNELRELEASE),)
    obj-m := aboutpci.o
else
    KERNELDIR := /lib/modules/$(shell uname -r)/build
    PWD := $(shell pwd)
default:
        $(MAKE) -C $(KERNELDIR) M=$(PWD) modules
endif


Once you have done, you will get the following result in files:
$ ls -al
-rw-rw-r-- 1 liudanny liudanny  3798 Aug 16  2017 aboutpci.c
-rw-rw-r-- 1 liudanny liudanny  6464 Jun 25 11:10 aboutpci.ko
-rw-rw-r-- 1 liudanny liudanny   363 Jun 25 11:10 .aboutpci.ko.cmd
-rw-rw-r-- 1 liudanny liudanny   542 Jun 25 11:10 aboutpci.mod.c
-rw-rw-r-- 1 liudanny liudanny  2536 Jun 25 11:10 aboutpci.mod.o
-rw-rw-r-- 1 liudanny liudanny 28760 Jun 25 11:10 .aboutpci.mod.o.cmd
-rw-rw-r-- 1 liudanny liudanny  5784 Jun 25 11:10 aboutpci.o
-rw-rw-r-- 1 liudanny liudanny 42695 Jun 25 11:10 .aboutpci.o.cmd
-rw-rw-r-- 1 liudanny liudanny   191 Jul 28  2017 Makefile
-rw-rw-r-- 1 liudanny liudanny    77 Jun 25 11:10 modules.order
-rw-rw-r-- 1 liudanny liudanny     0 Jul 28  2017 Module.symvers

Now, you can insert the module and see the kernel message
$ sudo insmod aboutpci.ko
$ dmesg

Thursday, June 21, 2018

[TensorFlow] How to get CPU configuration flags (such as SSE4.1, SSE4.2, and AVX...) in a bash script for building TensorFlow from source

Did you wonder what CPU configuration flags (such as SSE4.1, SSE4.2, and AVX...) you should use on your machine when building Tensorflow from source? If so, here is a quick solution for you.

1. Create a bash shell script file ( get_tf_build_cpu_opt.sh ) as below:
#!/usr/bin/env bash

# Detect platform
if [ "$(uname)" == "Darwin" ]; then
        # MacOS
        raw_cpu_flags=`sysctl -a | grep machdep.cpu.features | cut -d ":" -f 2 | tr '[:upper:]' '[:lower:]'`
elif [ "$(uname)" == "Linux" ]; then
        # GNU/Linux
        raw_cpu_flags=`grep flags -m1 /proc/cpuinfo | cut -d ":" -f 2 | tr '[:upper:]' '[:lower:]'`
else
        echo "Unknown plaform: $(uname)"
        exit -1
fi

COPT="--copt=-march=native"

for cpu_feature in $raw_cpu_flags
do
        case "$cpu_feature" in
                "sse4.1" | "sse4.2" | "ssse3" | "fma" | "cx16" | "popcnt" | "maes")
                    COPT+=" --copt=-m$cpu_feature"
                ;;
                "avx1.0")
                    COPT+=" --copt=-mavx"
                ;;
                *)
                        # noop
                ;;
        esac
done
echo $COPT

2. Execute it:
$ ./get_tf_build_cpu_opt.sh
==>  In my machine, I got these:
--copt=-march=native --copt=-mssse3 --copt=-mfma --copt=-mcx16 --copt=-mpopcnt

3. Now you can put these string in your bazel build command to build TensorFlow from source such as:
$ bazel build -c opt --copt=-mavx --copt=-mavx2 --copt=-mfma --copt=-mfpmath=both --copt=-msse4.2 --config=cuda -k //tensorflow/tools/pip_package:build_pip_package




Wednesday, June 20, 2018

[TensorFlow 記憶體優化實驗] Compare the memory options in Grappler Memory Optimizer

As we know that in Tensorflow, there is an optimization module called "Grappler". It provides many kinds of optimization functionalities, such as: Layout, Memory, ModelPruner, and so on... In this experiment, we can see the effect of some memory options enabled in a simple CNN model using MNIST dataset.

Here is the simple CNN model:

height = 28
width = 28
channels = 1
n_inputs = height * width

conv1_fmaps = 32
conv1_ksize = 3
conv1_stride = 1
conv1_pad = "SAME"

conv2_fmaps = 64
conv2_ksize = 3
conv2_stride = 1
conv2_pad = "SAME"
conv2_dropout_rate = 0.25

pool3_fmaps = conv2_fmaps

n_fc1 = 128
fc1_dropout_rate = 0.5

n_outputs = 10

with tf.device('/cpu:0'):
    with tf.name_scope("inputs"):
        X = tf.placeholder(tf.float32, shape=[None, n_inputs], name="X")
        X_reshaped = tf.reshape(X, shape=[-1, height, width, channels])
        y = tf.placeholder(tf.int32, shape=[None], name="y")
        training = tf.placeholder_with_default(False, shape=[], name='training')

with tf.device('/gpu:0'):
    conv1 = tf.layers.conv2d(X_reshaped, filters=conv1_fmaps, kernel_size=conv1_ksize,
                             strides=conv1_stride, padding=conv1_pad,
                             activation=tf.nn.relu, name="conv1")

    conv2 = tf.layers.conv2d(conv1, filters=conv2_fmaps, kernel_size=conv2_ksize,
                             strides=conv2_stride, padding=conv2_pad,
                             activation=tf.nn.relu, name="conv2")

    with tf.name_scope("pool3"):
        pool3 = tf.nn.max_pool(conv2, ksize=[1, 2, 2, 1], strides=[1, 2, 2, 1], padding="VALID")
        pool3_flat = tf.reshape(pool3, shape=[-1, pool3_fmaps * 14 * 14])
        pool3_flat_drop = tf.layers.dropout(pool3_flat, conv2_dropout_rate, training=training)

    with tf.name_scope("fc1"):
        fc1 = tf.layers.dense(pool3_flat_drop, n_fc1, activation=tf.nn.relu, name="fc1")
        fc1_drop = tf.layers.dropout(fc1, fc1_dropout_rate, training=training)

    with tf.name_scope("output"):
        logits = tf.layers.dense(fc1, n_outputs, name="output")
        Y_proba = tf.nn.softmax(logits, name="Y_proba")

    with tf.name_scope("train"):
        xentropy = tf.nn.sparse_softmax_cross_entropy_with_logits(logits=logits, labels=y)
        loss = tf.reduce_mean(xentropy)
        optimizer = tf.train.AdamOptimizer()
        training_op = optimizer.minimize(loss)

with tf.device('/cpu:0'):
    with tf.name_scope("eval"):
        correct = tf.nn.in_top_k(logits, y, 1)
        accuracy = tf.reduce_mean(tf.cast(correct, tf.float32))

with tf.name_scope("init_and_save"):
    init = tf.global_variables_initializer()
    saver = tf.train.Saver()

There are several memory options for you to use as follows:

  1. NO_MEM_OPT
  2. DEFAULT_MEM_OPT
  3. SWAPPING_HEURISTICS
  4. RECOMPUTATION_HEURISTICS
  5. SCHEDULING_HEURISTICS
  6. HEURISTICS
  7. Third Party: Gradient-Checkpointing
P.S: Gradient-Checkpointing is not related with Grappler's memory optimizer. it is just another approach.

You can use from Item 1 to item 6 and put it in the following place with red color characters.
from tensorflow.core.protobuf import rewriter_config_pb2
rewrite_options = rewriter_config_pb2.RewriterConfig(disable_model_pruning=True)
rewrite_options.memory_optimization = rewriter_config_pb2.RewriterConfig.<Put memory option here>
graph_options = tf.GraphOptions(rewrite_options=rewrite_options) #, infer_shapes=True)
config = tf.ConfigProto(graph_options=graph_options)

config.gpu_options.allow_growth=True
config.allow_soft_placement = True
For Gradient-Checkpoint approach, you should add the following code in the top of your program:
from tensorflow.contrib.memory_stats.python.ops import memory_stats_ops

#monkey patch tf.gradients to point to our custom version, with automatic checkpoint selection
def grads(ys, xs, grad_ys=None, **kwargs):
    return memory_saving_gradients.gradients(ys, xs, grad_ys,
                                             checkpoints='memory', **kwargs)
old_grads = tf.gradients
tf.__dict__["gradients"] = grads

I pick up several batch sizes and compare with the GPU memory usage. The result of memory options enabled is in the table below:

TensorFlow version = 1.8
Memory Option for Optimizer Batch Size: 9000 Batch Size: 11000 Batch Size: 11100 Batch Size: 11105
NO_MEM_OPT OK OOM OOM OOM
DEFAULT_MEM_OPT OK OK OOM OOM
SWAPPING_HEURISTICS OK OK OOM OOM
RECOMPUTATION_HEURISTICS OK OK OOM OOM
SCHEDULING_HEURISTICS OK OK OOM OOM
* HEURISTICS OK OK OK OK
Third Party: Check-pointing OK OOM OOM OOM

In my case, it seems that HEURISTICS is the best choice to optimize the memory usage when the batch size becomes extremely larger.

Update:
The situation is a little bit different in TensorFlow 1.9. Maybe I need to dig into the source code more. 
The max batch size changes to 11052, and the winner is not "HEURISTICS" anymore. 
Here you go:

Max Batch Size: 11052
NO_MEM_OPT: out of memory
SWAPPING_HEURISTICS: 6755.58 MB
RECOMPUTATION_HEURISTICS: 6723.38 MB
SCHEDULING_HEURISTICS: 6755.58 MB
HEURISTICS: out of memory


Thursday, June 14, 2018

[XLA 研究] How to use XLA AOT compilation in TensorFlow

This document is going to explain how to use AOT compilation in TensorFlow. We will use the tool: tfcompile, which is a standalone tool that ahead-of-time (AOT) compiles TensorFlow graphs into executable code. It can reduce the total binary size, and also avoid some runtime overheads. A typical use-case of tfcompile is to compile an inference graph into executable code for mobile devices. The following steps are as follows:

1. Build tool: tfcompile
> bazel build --config=opt --config=cuda //tensorflow/compiler/aot:tfcompile

2. Run this file: make_simple_test_graph.py to build graph & config files as follows:
import argparse
import os
import sys

from tensorflow.core.protobuf import saver_pb2
from tensorflow.python.client import session
from tensorflow.python.framework import constant_op
from tensorflow.python.framework import dtypes
from tensorflow.python.framework import function
from tensorflow.python.framework import ops
from tensorflow.python.ops import array_ops
from tensorflow.python.ops import control_flow_ops
from tensorflow.python.ops import math_ops
from tensorflow.python.ops import variables
from tensorflow.python.platform import app
from tensorflow.python.training import saver as saver_lib

FLAGS = None

def tfmatmul(_):
  x = array_ops.placeholder(dtypes.float32, name='x_hold')
  y = array_ops.placeholder(dtypes.float32, name='y_hold')
  math_ops.matmul(x, y, name='x_y_prod')

def tfmatmulandadd(_):
  # This tests multiple outputs.
  x = array_ops.placeholder(dtypes.float32, name='x_hold')
  y = array_ops.placeholder(dtypes.float32, name='y_hold')
  math_ops.matmul(x, y, name='x_y_prod')
  math_ops.add(x, y, name='x_y_sum')

def write_graph(build_graph, out_dir):
  """Build a graph using build_graph and write it out."""
  g = ops.Graph()
  with g.as_default():
    build_graph(out_dir)
    filename = os.path.join(out_dir, 'test_graph_%s.pb' % build_graph.__name__)
    with open(filename, 'wb') as f:
      f.write(g.as_graph_def().SerializeToString())

def main(_):
  write_graph(tfmatmul, FLAGS.out_dir)

if __name__ == '__main__':
  parser = argparse.ArgumentParser()
  parser.register('type', 'bool', lambda v: v.lower() == 'true')
  parser.add_argument(
      '--out_dir',
      type=str,
      default='',
      help='Output directory for graphs, checkpoints and savers.')
  FLAGS, unparsed = parser.parse_known_args()
  app.run(main=main, argv=[sys.argv[0]] + unparsed)

3. Add scripts to BUILD and generate C++ header files:
> vi BUILD

load("//tensorflow/compiler/aot:tfcompile.bzl", "tf_library")
tf_library(
    name = "test_graph_tfmatmul",
    graph = "test_graph_tfmatmul.pb",
    config = "test_graph_tfmatmul.config.pbtxt",
    cpp_class = "foo::bar::MatMulComp"
)
> bazel build :test_graph_tfmatmul

4. Write test C++ code:
> vi my_code.cc

# include "tensorflow/compiler/aot/tests/myaot/test_graph_tfmatmul.h"

int main(int argc, char** argv) {
  foo::bar::MatMulComp matmul;

  // Set up args and run the computation.
  const float args[12] = {1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12};
  std::copy(args + 0, args + 6, matmul.arg0_data());
  std::copy(args + 6, args + 12, matmul.arg1_data());
  matmul.Run();

  // Check result
  if (matmul.result0(0, 0) == 58) {
    std::cout << "Success" << std::endl;
  } else {
    std::cout << "Failed. Expected value 58 at 0,0. Got:"
              << matmul.result0(0, 0) << std::endl;
  }

  return 0;
}

5. Add scripts to BUILD and build my_code.cc
> vi BUILD

cc_binary(
    name = "my_binary",
    srcs = [ "my_code.cc" ],
    deps = [ ":test_graph_tfmatmul", "//third_party/eigen3" ],
    linkopts = [ "-lpthread" ]
)
> bazel build :my_binary

6. To run the result:
> bazel run my_binary
Build XLA AOT compilation

So, I got the result==> "Success", which is proved that the MatMul calculation in AOT compilation is correct!





Friday, June 8, 2018

[XLA 研究] Take a glance to see the graph changes in XLA JIT compilation

In the preamble of this article, to understand XLA JIT is pretty hard because you probably need to understand TensorFlow Graph, Executor,  LLVM, and math... I have been through this painful study work somehow so that I hope my experience can help for those who are interested in XLA but have not get understood yet.

First, I use the following code to build my TF graph.
W = tf.get_variable(shape=[], name='weights')
b = tf.get_variable(shape=[], name='bias')
x_observed = tf.placeholder(shape=[None],
                            dtype=tf.float32,
                            name='x_observed')
y_pred = W * x_observed + b
learning_rate = 0.025

y_observed = tf.placeholder(shape=[None], dtype=tf.float32, name='y_observed')

loss_op = tf.reduce_mean(tf.square(y_pred - y_observed))
optimizer_op = tf.train.GradientDescentOptimizer(learning_rate)
train_op = optimizer_op.minimize(loss_op)
I try to dump all the temporary graphs during the XLA JIT compilation by the following flags and env variable. So, after execution, you will get a lot of pbtxt files and log.
export TF_CPP_MIN_VLOG_LEVEL=2
TF_XLA_FLAGS="--xla_hlo_profile --tf_xla_clustering_debug --tf_dump_graph_prefix=/tmp --vmodule=xla_compiler=2 --xla_hlo_graph_path=/tmp --xla_generate_hlo_graph=.*" python test.py 2>&1 | tee xla_log.txt
Unfortunately, protobuf library cannot read some of the XLA dump pbtxt files so that I only can show what I get.
The way to convert pbtxt file to PNG image is here for reference:
import tensorflow as tf
import sys

from google.protobuf import text_format
from graphviz import Digraph

dot = Digraph()

if len(sys.argv) != 2:
  sys.exit("convert <graphdef.pb>")

with tf.gfile.FastGFile(sys.argv[1], 'rb') as f:
  graph_def = tf.GraphDef()
  text_format.Merge(f.read(), graph_def)
  tf.import_graph_def(graph_def)

  for n in graph_def.node:
    dot.node(n.name, label=n.name)

    for i in n.input:
      # Edges are determined by the names of the nodes
      dot.edge(i, n.name)

dot.format = 'png'
dot.render(sys.argv[1] + ".gv", view=True)

So, based on my graph generated at the beginning of the article, we only need focus on these two TensorFlow subgraphs:
1. W = tf.get_variable(shape=[], name='weights')
 2. b = tf.get_variable(shape=[], name='bias')

The reason why we only to see these is that the following explanation about XLA JIT Compilation is all around them.
Actually, XLA JIT compilation process is very complicated. I only want to list the process items of XLA JIT Compilation in execution period.

Before going into HLO optimization phase, TensorFlow will do JIT Compilation Pass first, which contains 3 main steps:

1. Mark For Compilation Pass:

2. Encapsulate Subgraph Pass:

Before:
After:

3. Build Xla Launch Ops Pass:


Then, before rendering low-level assembly code for the graph, TensorFlow will run HLO Pass Pipeline Optimization using HLO IR, which contains several steps:

1. Optimization:

2. Simplification:

3. Conv Canonicalization:

4. Layout Assignment:

5. Fusion:

6. Reduce Precision:

7. GPU IR Emit Prepare:



The HLO IR looks like this:



The LLVM IR looks like this:


In this phase, XLA can use GPU Compiler to run backend for generating binary executable code based on LLVM IR, but this is not in my discussion this time.

Thursday, June 7, 2018

[TX2 研究] My first try on Jetson TX2

I got a Jetson TX2 several days ago from my friend and it looks like following pictures. I setup it using Nivida's installing tool: JetPack-L4T-3.2 version (JetPack-L4T-3.2-linux-x64_b196.run). During the installation, I indeed encounter some issues with not abling to setup IP address on TX2, and I resolved it. If anyone still has this issue, let me know and I will post another article to explain the resolving steps. 




Basically, on TX2 there is no "nvidia-smi" this kind of command tool for you to check the GPU card's status, you need to use these as below:

1. Use deviceQuery to get hardware information
nvidia@tegra-ubuntu:~$ /usr/local/cuda-9.0/bin/cuda-install-samples-9.0.sh .
nvidia@tegra-ubuntu:~$ cd NVIDIA_CUDA-9.0_Samples/1_Utilities/deviceQuery
nvidia@tegra-ubuntu:~/NVIDIA_CUDA-9.0_Samples/1_Utilities/deviceQuery$ make
nvidia@tegra-ubuntu:~/NVIDIA_CUDA-9.0_Samples/1_Utilities/deviceQuery$ ./deviceQuery

2. Use tegrastats ( in user: nvidia's home directory ) to get hardware status.
nvidia@tegra-ubuntu:~$ ./tegrastats
RAM 5974/7854MB (lfb 31x4MB) CPU [0%@1113,0%@345,0%@345,0%@1113,0%@1113,0%@1113] BCPU@49.5C MCPU@49.5C GPU@51.5C PLL@49.5C Tboard@44C Tdiode@52C PMIC@100C thermal@50.5C VDD_IN 11324/11324 VDD_CPU 304/304 VDD_GPU 5642/5642 VDD_SOC 1295/1295 VDD_WIFI 0/0 VDD_DDR 2654/2654
RAM 5974/7854MB (lfb 31x4MB) CPU [14%@345,0%@345,0%@345,11%@345,10%@345,11%@345] BCPU@49.5C MCPU@49.5C GPU@51.5C PLL@49.5C Tboard@44C Tdiode@51.75C PMIC@100C thermal@50.3C VDD_IN 11214/11269 VDD_CPU 305/304 VDD_GPU 5566/5604 VDD_SOC 1295/1295 VDD_WIFI 0/0 VDD_DDR 2615/2634
RAM 5974/7854MB (lfb 31x4MB) CPU [16%@345,0%@345,0%@345,5%@345,7%@345,4%@345] BCPU@49.5C MCPU@49.5C GPU@52C PLL@49.5C Tboard@44C Tdiode@52.25C PMIC@100C thermal@51.3C VDD_IN 12239/11592 VDD_CPU 304/304 VDD_GPU 6250/5819 VDD_SOC 1371/1320 VDD_WIFI 0/0 VDD_DDR 2807/2692

If you want to install TensorFlow on TX2, right now this task is very easy because Nvidia has provided an automatic installation script here: 
nvidia@tegra-ubuntu:~$ git clone https://github.com/JasonAtNvidia/JetsonTFBuild
nvidia@tegra-ubuntu:~$ cd JetsonTFBuild && sudo ./BuildTensorflow.sh

I also did an experiment to compare the training job speed with a normal server with one Nvidia GTX 1080 Ti card inside. The training script is in my github:
https://github.com/teyenliu/pyutillib/blob/master/mnist_gpu_tx2.py

In my training experiment, the condition of a simple CNN model and the batch size are the same, and the result is that GTX 1080 Ti is 
11 times faster than TX2. (It makes sense because TX2 is born to do edge's inference jobs)

The following picture is about TX2 runs the training job. The fan only be turned on when its temperature is higher than around 50C degree.