Wednesday, June 12, 2019

[TensorFlow] Build TensorFlow from source with Intel MKL enabled

Based on this document "Intel® Math Kernel Library for Deep Learning Networks: Part 1–Overview and Installation",  I give a quick summary to do it.
Here are the steps for building TensorFlow from source with Intel MKL enabled.

For MKL DNN:
# Install Intel MKL & MKL-DNN
$ git clone https://github.com/01org/mkl-dnn.git
$ cd mkl-dnn
$ cd scripts && ./prepare_mkl.sh && cd ..
$ mkdir -p build && cd build && cmake .. && make -j$(nproc)
$ make test
$ sudo make install
For building TensorFlow:

# Build TensorFlow from source with MKL enabled
$ git clone https://github.com/tensorflow/tensorflow.git
$ cd tensorflow && git checkout r1.13
$ ./configure
...
$ bazel build --config=mkl \
  --config=opt \
  --copt=-O3 \
  --cxxopt=-D_GLIBCXX_USE_CXX11_ABI=0 \
  //tensorflow/tools/pip_package:build_pip_package
For installing TensorFlow:
# Install TensorFlow
$ rm -rf /tmp/tensorflow_pkg
$ bazel-bin/tensorflow/tools/pip_package/build_pip_package /tmp/tensorflow_pkg
$ pip3 uninstall -y tensorflow
$ pip3 install /tmp/tensorflow_pkg/tensorflow-1.13.1-cp35-cp35m-linux_x86_64.whl

From now, we can run TensorFlow Python code with Intel MKL.
Please refer to this document:
https://www.tensorflow.org/guide/performance/overview

Tuning MKL for the best performance

This section details the different configurations and environment variables that can be used to tune the MKL to get optimal performance. Before tweaking various environment variables make sure the model is using the NCHW (channels_firstdata format. The MKL is optimized for NCHW and Intel is working to get near performance parity when using NHWC.
MKL uses the following environment variables to tune performance:
  • KMP_BLOCKTIME - Sets the time, in milliseconds, that a thread should wait, after completing the execution of a parallel region, before sleeping.
  • KMP_AFFINITY - Enables the run-time library to bind threads to physical processing units.
  • KMP_SETTINGS - Enables (true) or disables (false) the printing of OpenMP* run-time library environment variables during program execution.
  • OMP_NUM_THREADS - Specifies the number of threads to use.
More details on the KMP variables are on Intel's site and the OMP variables on gnu.org
While there can be substantial gains from adjusting the environment variables, which is discussed below, the simplified advice is to set the inter_op_parallelism_threads equal to the number of physical CPUs and to set the following environment variables:
  • KMP_BLOCKTIME=0
  • KMP_AFFINITY=granularity=fine,verbose,compact,1,0
Example setting MKL variables with command-line arguments:
$ KMP_BLOCKTIME=0 KMP_AFFINITY=granularity=fine,verbose,compact,1,0 \
KMP_SETTINGS=1 python your_python_script.py
P.S:
To see what kind of CPU instructions supported in your machine:
gcc -march=native -dM -E - < /dev/null | egrep "SSE|AVX" | sort





No comments: