google-ai-edge/ai-edge-quantizer

Repository files navigation

A quantizer for advanced developers to quantize converted LiteRT models. It aims to facilitate advanced users to strive for optimal performance on resource demanding models (e.g., GenAI models).

Build TypeStatus
Unit Tests (Linux)
Nightly Release
Nightly Colab
  • Python versions: 3.9, 3.10, 3.11, 3.12
  • Operating system: Linux, MacOS
  • TensorFlow: tf-nightly

Nightly PyPi package:

pip install ai-edge-quantizer-nightly

The quantizer requires two inputs:

  1. An unquantized source LiteRT model (with FP32 data type in the FlatBuffers format with .tflite extension)
  2. A quantization recipe (details below)

and outputs a quantized LiteRT model that's ready for deployment on edge devices.

In a nutshell, the quantizer works according to the following steps:

  1. Instantiate a Quantizer class. This is the entry point to the quantizer's functionalities that the user accesses.
  2. Load a desired quantization recipe (details in subsection).
  3. Quantize (and save) the model. This is where most of the quantizer's internal logic works.
qt = quantizer.Quantizer("path/to/input/tflite")
qt.load_quantization_recipe(recipe.dynamic_wi8_afp32())
qt.quantize().export_model("/path/to/output/tflite")

Please see the getting started colab for the simplest quick start guide on those 3 steps, and the selective quantization colab with more details on advanced features.

Please refer to the LiteRT documentation for ways to generate LiteRT models from Jax, PyTorch and TensorFlow. The input source model should be an FP32 (unquantized) model in the FlatBuffers format with .tflite extension.

The user needs to specify a quantization recipe using AI Edge Quantizer's API to apply to the source model. The quantization recipe encodes all information on how a model is to be quantized, such as number of bits, data type, symmetry, scope name, etc.

Essentially, a quantization recipe is defined as a collection of the following command:

“Apply Quantization Algorithm X on Operator Y under Scope Z with ConfigN”.

For example:

"Uniformly quantize the FullyConnected op under scope 'dense1/' with INT8 symmetric with Dynamic Quantization".

All the unspecified ops will be kept as FP32 (unquantized). The scope of an operator in TFLite is defined as the output tensor name of the op, which preserves the hierarchical model information from the source model (e.g., scope in TF). The best way to obtain scope name is by visualizing the model with Model Explorer.

The simplest recipe to get started with is using existing recipes from recipe.py. This is demonstrated in the getting started colab example.

Please refer to the LiteRT deployment documentation for ways to deploy a quantized LiteRT model.

There are many ways the user can configure and customize the quantization recipe beyond using a template in recipe.py. For example, the user can configure the recipe to achieve these features:

  • Selective quantization (exclude selected ops from being quantized)
  • Flexible mixed scheme quantization (mixture of different precision, compute precision, scope, op, config, etc)
  • 4-bit weight quantization

The selective quantization colab shows some of these more advanced features.

For specifics of the recipe schema, please refer to the OpQuantizationRecipe in [recipe_manager.py].

For advanced usage involving mixed quantization, the following API may be useful:

  • Use Quantizer:load_quantization_recipe() in quantizer.py to load a custom recipe.
  • Use Quantizer:update_quantization_recipe() in quantizer.py to extend or override specific parts of the recipe.

The table below outlines the allowed configurations for available recipes.

ConfigDYNAMIC_WI8_AFP32DYNAMIC_WI4_AFP32STATIC_WI8_AI16STATIC_WI4_AI16STATIC_WI8_AI8STATIC_WI4_AI8WEIGHTONLY_WI8_AFP32WEIGHTONLY_WI4_AFP32
activationnum_bitsNoneNone161688NoneNone
symmetricNoneNoneTRUETRUE[TRUE, FALSE][TRUE, FALSE]NoneNone
granularityNoneNoneTENSORWISETENSORWISETENSORWISETENSORWISENoneNone
dtypeNoneNoneINTINTINTINTNoneNone
weightnum_bits84848484
symmetricTRUETRUETRUETRUETRUETRUE[TRUE, FALSE][TRUE, FALSE]
granularity[CHANNELWISE, TENSORWISE][CHANNELWISE, TENSORWISE][CHANNELWISE, TENSORWISE][CHANNELWISE, TENSORWISE][CHANNELWISE, TENSORWISE][CHANNELWISE, TENSORWISE][CHANNELWISE, TENSORWISE][CHANNELWISE, TENSORWISE]
dtypeINTINTINTINTINTINTINTINT
explicit_dequantizeFALSEFALSEFALSEFALSEFALSEFALSETRUETRUE
compute_precisionINTEGERINTEGERINTEGERINTEGERINTEGERINTEGERFLOATFLOAT

Operators Supporting Quantization

ConfigDYNAMIC_WI8_AFP32DYNAMIC_WI4_AFP32STATIC_WI8_AI16STATIC_WI4_AI16STATIC_WI8_AI8STATIC_WI4_AI8WEIGHTONLY_WI8_AFP32WEIGHTONLY_WI4_AFP32
FULLY_CONNECTED
CONV_2D
BATCH_MATMUL
EMBEDDING_LOOKUP
DEPTHWISE_CONV_2D
AVERAGE_POOL_2D
RESHAPE
SOFTMAX
TANH
TRANSPOSE
GELU
ADD
CONV_2D_TRANSPOSE
SUB
MUL
MEAN
RSQRT
CONCATENATION
STRIDED_SLICE
SPLIT
LOGISTIC
SLICE
SELECT_V2
SUM

About

AI Edge Quantizer: flexible post training quantization for LiteRT models.

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Contributors 13