[TensorFlow] How to get CPU configuration flags in bash script for build TensorFlow from source

Did you wonder what CPU configuration flags you should use on your machine when building Tensorflow from source? If so, here is a quick solution for you.

1. Create a bash shell script file ( ) as below:
#!/usr/bin/env bash # Detect platform if [ "$(uname)" == "Darwin" ]; then # MacOS raw_cpu_flags=`sysctl -a | grep machdep.cpu.features | cut -d ":" -f 2 | tr '[:upper:]' '[:lower:]'` elif [ "$(uname)" == "Linux" ]; then # GNU/Linux raw_cpu_flags=`grep flags -m1 /proc/cpuinfo | cut -d ":" -f 2 | tr '[:upper:]' '[:lower:]'` else echo "Unknown plaform: $(uname)" exit -1 fi COPT="--copt=-march=native" for cpu_feature in $raw_cpu_flags do case "$cpu_feature" in "sse4.1" | "sse4.2" | "ssse3" | "fma" | "cx16" | "popcnt&quo…

[TensorFlow 記憶體優化實驗] Compare the memory options in Grappler Memory Optimizer

As we know that in Tensorflow, there is an optimization module called "Grappler". It provides many kinds of optimization functionalities, such as: Layout, Memory, ModelPruner, and so on... In this experiment, we can see the effect of some memory options enabled in a simple CNN model using MNIST dataset.

Here is the simple CNN model:

height = 28 width = 28 channels = 1 n_inputs = height * width conv1_fmaps = 32 conv1_ksize = 3 conv1_stride = 1 conv1_pad = "SAME" conv2_fmaps = 64 conv2_ksize = 3 conv2_stride = 1 conv2_pad = "SAME" conv2_dropout_rate = 0.25 pool3_fmaps = conv2_fmaps n_fc1 = 128 fc1_dropout_rate = 0.5 n_outputs = 10 with tf.device('/cpu:0'): with tf.name_scope("inputs"): X = tf.placeholder(tf.float32, shape=[None, n_inputs], name="X") X_reshaped = tf.reshape(X, shape=[-1, height, width, channels]) y = tf.placeholder(tf.int32, shape=[None], name="y") training = tf.p…

[XLA 研究] How to use XLA AOT compilation in TensorFlow

This document is going to explain how to use AOT compilation in TensorFlow. We will use the tool: tfcompile, which is a standalone tool that ahead-of-time (AOT) compiles TensorFlow graphs into executable code. It can reduce the total binary size, and also avoid some runtime overheads. A typical use-case of tfcompile is to compile an inference graph into executable code for mobile devices. The following steps are as follows:

1. Build tool: tfcompile
> bazel build --config=opt --config=cuda //tensorflow/compiler/aot:tfcompile
2. Run this file: to build graph & config files as follows:
import argparse import os import sys from tensorflow.core.protobuf import saver_pb2 from tensorflow.python.client import session from tensorflow.python.framework import constant_op from tensorflow.python.framework import dtypes from tensorflow.python.framework import function from tensorflow.python.framework import ops from tensorflow.python.ops import array_ops from tensor…

[XLA 研究] Take a glance to see the graph changes in XLA JIT compilation

In the preamble of this article, to understand XLA JIT is pretty hard because you probably need to understand TensorFlow Graph, Executor,  LLVM, and math... I have been through this painful study work somehow so that I hope my experience can help for those who are interested in XLA but have not get understood yet.

First, I use the following code to build my TF graph.
W = tf.get_variable(shape=[], name='weights') b = tf.get_variable(shape=[], name='bias') x_observed = tf.placeholder(shape=[None], dtype=tf.float32, name='x_observed') y_pred = W * x_observed + b learning_rate = 0.025 y_observed = tf.placeholder(shape=[None], dtype=tf.float32, name='y_observed') loss_op = tf.reduce_mean(tf.square(y_pred - y_observed)) optimizer_op = tf.train.GradientDescentOptimizer(learning_rate) train_op = optimizer_op.minimize(loss_op) I try to dump all the temporary graphs during the XLA JIT compilation by the fol…

[TX2 研究] My first try on Jetson TX2

I got a Jetson TX2 several days ago from my friend and it looks like following pictures. I setup it using Nivida's installing tool: JetPack-L4T-3.2 version ( During the installation, I indeed encounter some issues with not abling to setup IP address on TX2, and I resolved it. If anyone still has this issue, let me know and I will post another article to explain the resolving steps. 

Basically, on TX2 there is no "nvidia-smi" this kind of command tool for you to check the GPU card's status, you need to use these as below:

1. Use deviceQuery to get hardware information
nvidia@tegra-ubuntu:~$ /usr/local/cuda-9.0/bin/ . nvidia@tegra-ubuntu:~$ cd NVIDIA_CUDA-9.0_Samples/1_Utilities/deviceQuery nvidia@tegra-ubuntu:~/NVIDIA_CUDA-9.0_Samples/1_Utilities/deviceQuery$ make nvidia@tegra-ubuntu:~/NVIDIA_CUDA-9.0_Samples/1_Utilities/deviceQuery$ ./deviceQuery
2. Use tegrastats ( in user: nvidia's home directory ) to g…

[Caffe] Try out Caffe with Python code

This document is just a testing record to try out on Caffe with Python code. I refer to this blog. For using Python, we can easily to access every data flow blob in layers, including diff blob, weight blob and bias blob. It is so convenient for us to understand the change of training phase's weights and what have done in each step.

importosimportnumpyasnpimportcaffecaffe_root='/home/liudanny/git/caffe'os.chdir(caffe_root)solver_prototxt='examples/mnist/lenet_solver.prototxt'solver=caffe.SGDSolver(solver_prototxt) print out all the data flow blobs[(k,,vinnet.blobs.items()][('data',(64,1,28,28)),('label',(64,)),('conv1',(64,20,24,24)),('pool1',(64,20,12,12)),('conv2',(64,50,8,8)),('pool2',(64,50,4,4)),('ip1',(64,500)),('ip2',(64,10)),('loss',())]# print out all the diff blobs[(k,v.diff.shape)fork,vinnet.blobs.items()]Out[20]:[('data',(64,1,28,28)),('labe…