Monday, July 9, 2018

[TensorFlow] How to implement LMDBDataset in tf.data API?

I have finished implementing the LMDBDataset in tf.data API.  It could be not the bug-free component, but at least it's my first time to try to implement C++ and Python function in TensorFlow. The API architecture looks like this:












The whole implemented code is in my fork's TensorFlow repo with branch r1.8:
https://github.com/teyenliu/tensorflow/tree/r1.8

If you want to see what's implemented, please check it out:
https://github.com/teyenliu/tensorflow/commit/3941debe3001d52fe9a6d4048bd679a5a1f0f075

Basically, it can be used like the way of TFRecordDataset, TextLineDataset. The following is the example to use TFRecordDataset:

By the way, I also provide some samples for those who want to benchmark TFRecordDataset, LMDBDataset or others' performance. Please also check the following:
https://github.com/teyenliu/tensorflow/tree/r1.8/tensorflow/examples/how_tos/reading_data

convert_to_records_lmdb.py: This python file is to convert MNIST data format into lmdb,
which yields datapoints.

fully_connected_reader_lmdb.py: This python file is to train a fully connected neural net with MNIST data in lmdb,
which contains a new argument perf to only measure the performance of input data pipeline.

Example 1: to train on MNIST dataset, you may give the following command:

$ python fully_connected_reader_lmdb.py --train_dir ./lmdb_data --num_epochs 10 --batch_size 128 --perf training
Example 2: to check the performance of data pipeline on MNIST dataset, you may give the following command:
$ python fully_connected_reader_lmdb.py --train_dir ./lmdb_data --num_epochs 10 --batch_size 128 --perf datapipeline

The performance result shows that TFRecordDataset APIs is still faster than others in speed performance test.

No comments: