Getting Started with TensorFlow Lite on reTerminal

Dark

TensorFlow Lite is a set of tools that enables on-device machine learning by helping developers run their models on mobile, embedded, and IoT devices. The key features of TensorFlow Lite are optimized for on-device machine learning, with a focus on latency, privacy, connectivity, size, and power consumption. The framework is built to provide support for multiple platforms, including Android and iOS devices, embedded Linux, and microcontrollers. It also has built-in support for a variety of languages, such as Java, Swift, Objective-C, C++, and Python, and it has high performance with hardware acceleration and model optimization. It provides end-to-end examples for common machine learning tasks, such as image classification, object detection, pose estimation, question answering, and text classification, on multiple platforms.

TensorFlow Lite Runtime Package Installation

The tflite_runtime package is a smaller, simplified Python package that includes the bare minimum code required to run inference with TensorFlow Lite. This package is ideal when all you want to do is execute .tflite models and avoid wasting disk space with the large TensorFlow library.

For best performance it is recommended to use 64bit OS and corresponding TFLite package, with optimized XNNPACK delegate enabled. These can be compiled from source by yourself or installed with pre-built binaries provided by Seeed studio. Alternatively, you can install latest stable version with pip

Latest stable version (official builds)

pip3 install --index-url https://google-coral.github.io/py-repo/ tflite_runtime

Performance optimized package for 64-bit OS with XNNPACK enabled

Official pre-built wheels for Python 3.7 64bit OS with XNNPACK optimizations were not available at the moment of writing of this article, so we compiled and shared them ourselves.

wget www.files.seeedstudio.com/ml/TFLite/tflite_runtime-2.6.0-cp37-cp37m-linux_aarch64.whl
pip3 install tflite_runtime-2.6.0-cp37-cp37m-linux_aarch64.whl

After installation is complete, try importing tflite package:

pi@raspberrypi:~ $ python3
Python 3.7.3 (default, Jul 25 2020, 13:03:44) 
[GCC 8.3.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import tflite_runtime
>>> tflite_runtime.__version__
'2.6.0'

Examples

It is possible to use TFLite Converter to convert any Tensorflow model into .tflite format, provided it only consists of operations supported by TFLite Runtime. The following is list of demos currently tested on reTerminal, that will be expanded and completed in future:

Model	Result	Comments
Object Detection		Demo: Vehicle Detection Jupyter Notebook Example scripts alpha 0.25 224x224 66.7 FPS (15 ms.) alpha 0.5 224x224 40 FPS (25 ms.) alpha 0.75 320x320 14.9 FPS (67 ms.) alpha 1.0 320x320 10.4 FPS (96 ms.)
Image Classification		Demo: Industrial Conveyor Rip Identification Jupyter Notebook Example scripts
Semantic segmentation		Demo: Lung segmentation Jupyter Notebook Example scripts
Face age/gender recognition		Demo: Multi-stage inference: MobileNet YOLOv3 alpha 0.25 -> MobileFaceNet Github repository Example scripts ~16-20 FPS (with ARM NN)
Face expression recognition		Demo: Multi-stage inference: MobileNet YOLOv3 alpha 0.25 -> MobileFaceNet Github repository Example scripts ~11 FPS
Face anti-spoofing		Demo: Multi-stage inference: MobileNet YOLOv3 alpha 0.25 -> MobileNet v1 alpha 0.25 Jupyter Notebook Example scripts ~23 FPS (ARM NN)
Face Recognition		Demo: Multi-stage inference: Ultra Light Face Detector with Landmark Detection -> MobileFaceNet Jupyter Notebook Example scripts ~15 FPS (ARM NN)

Further optimization

The FPS and inference results in the Example table are given for INT8 quantized models inference in Tensorflow Lite, unless stated otherwise.
Since reTerminal is Raspberry Pi 4 based, it has no additional hardware accelerators for neural network inference, thus only standard optimization methods for CPU inference can be applied. The video overview of this topic is presented here:

Below is the brief overview of CPU inference optimization methods:

1) Designing smaller networks. If the goal is simple enough (image classification of < 100 classes or object detection of < 10 classes or similar), a smaller network can achieve acceptable accuracy and run very fast. For example, MobileNet v1 alpha 0.25 YOLOv2 network trained to detect only one class of objects (human faces) achieves 62.5 FPS without any further optimization.

Vanilla Tensorflow Lite FP32 inference: MobileNetv1(alpha 0.25) YOLOv2 1 class 0.89 MB 62.5 FPS MobileNetv1(alpha 1.0) YOLOv3 20 class 13.1 MB 7 FPS

2) Quantization. Quantization is process of reducing precision for NN network weights, usually from FP32 to INT8. It reduces the size by 4x and latency by ~60-80% using default Tensorflow Lite kernels. Accuracy loss can be minimized by using QAT - quantization-aware training, which is the process of fine-tuning network with quantization nodes inserted.

Vanilla Tensorflow Lite INT8 inference: MobileNetv1(alpha 0.25) YOLOv2 1 class 0.89 MB 77 FPS MobileNetv1(alpha 1.0) YOLOv3 20 class 13.1 MB 11.5 FPS

3) Using optimized kernels. Inference speed can be improved by utilizing frameworks that have CNN kernels optimized for specific CPU instructions set, e.g. NEON SIMD instructions for ARM. Examples of such networks include ARM NN and XNNPACK.

Arm NN SDK is a set of open-source software and tools that enables machine learning workloads on power-efficient devices. The description and provided benchmarks look promising, but the installation procedure on latest Raspberry Pi OS is painful at the moment - the only proper way to install latest version of ARM NN currently is cross-compiling from source. There are binaries available for Debian Bullseye, but Raspberry Pi OS is still at Debian Buster. The inference test results with benchmark scripts were mixed, for a single model it showed worse performance than even vanilla Tensorflow Lite, but it turned out to be faster in multi-model inference, possibly due to more efficient multi-processing utilization.

ARM NN FP32 inference: MobileNetv1(alpha 0.25) YOLOv2 1 class 0.89 MB 83 FPS MobileNetv1(alpha 1.0) YOLOv3 20 class 13.1 MB 7.2 FPS

XNNPACK is a library for accelerating neural network inference for ARM, x86, and WebAssembly architectures in Android, iOS, Windows, Linux, macOS environments. It is integrated in Tensorflow Lite as a delegate, which is enabled by default for Android build, but for other environments needs to be enabled manually - thus if you’d like to use XNNPACK on Raspberry Pi 4, you’ll need either to build TensorFlow Lite Interpreter package from source or download one of the third-party binaries, such as the one we provide above.

XNNPACK delegate Tensorflow Lite FP32 inference: MobileNetv1(alpha 0.25) YOLOv2 1 class 0.89 MB 83 FPS MobileNetv1(alpha 1.0) YOLOv3 20 class 13.1 MB 7.2 FPS

Main problem with optimized kernels is the uneven support of different architectures/NN operators/precision types in different frameworks. For example INT8 optimized kernels are work-in-progress both in ARM NN and XNNPACK. The support for INT8 optimized kernels in XNNPACK was added very recently and seems to bring modest performance improvement, of about ~30%, depending on operators used in the model. https://github.com/google/XNNPACK/issues/999#issuecomment-870791779

Another promising lead is optimized kernels for dynamically quantized models, see the conversation with developer here: https://github.com/tensorflow/tensorflow/pull/48751#issuecomment-869111116

The developer claims 3-4x latency improvement, but currently it is only limited to very specific set of models. A PR to allow more convenient usage is in development.

4) Pruning and sparse inference. Pruning is a process of fine-tuning trained neural network to find weights, that do not contribute to correct predictions. This allows for reducing both size and latency of the models - the accuracy reduction depends on sparsity settings. Experimentally it is possible to achieve up to 80% sparsity with negligible impact on accuracy. See details here https://ai.googleblog.com/2021/03/accelerating-neural-networks-on-mobile.html and a guide to pruning with tensorflow here https://www.tensorflow.org/model_optimization/guide/pruning/pruning_for_on_device_inference Unfortunately in current form, only very limited set of models support pruning and sparse inference with XNNPACK.

F.A.Q

Q1: My company's policy doesn't allow us to use 3rd party binaries

You can use official TFLite interpreter package or alternatively compile it from source by following instructions here.

Resources

[Web Page] TensorFlow Lite Official Webpage
[Web Page] TensorFlow Lite Official Documentation

Getting Started with TensorFlow Lite on reTerminal

TensorFlow Lite Runtime Package Installation​

Latest stable version (official builds)​

Performance optimized package for 64-bit OS with XNNPACK enabled​

Examples​

Further optimization​

F.A.Q​

Q1: My company's policy doesn't allow us to use 3rd party binaries​

Resources​