Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

[Proposal] Add quantized tensor types.

See original GitHub issue

I know that there are some proposals of adding quantized types to the ONNX standard but I just want to clarify the approach. Myself I am contributing in a ML framework which supports ONNX, namely Glow which is an open source project: https://github.com/pytorch/glow. The pain right now is that ONNX is not mature enough in terms of quantization support to allow the description of pre-quantized models. Notes:

A quantized type is nothing but an integer type (int8, int16, int32) with the addition of 2 parameters which describe the translation to/from a float32 tensor:
- scale (float32)
- offset (normally chosen as int32 to accomodate all int8/int16/int32 quantized types) - this sometimes is also reffered to as “zero_point”
- the correspondence between a float value xf and a quantized value xq is: xf = scale * (xq - offset)
These quantized types can be named “int8q”, “int16q”, “int32q”. Of course we can also talk about the uint8, uint16 or uint32 quantized types but these are not friendly to the hardware instructions since most instructions are using signed arithmetic. Might be one of the reason TensorFlowLite changed the infrastructure from uint8 to int8 quantization. Of course we can go lower than that, for example 4bit, 2bit or 1bit quantization (binary networks).
The quantization schema does not require to be explicitly described in the tensors/operators. From my experience we can talk about the following quantization schemas:
- Asymmetric - this means no assumptions/restrictions for the quantization parameters
- Symmetric - this means that the quantization parameter offset is 0. This has run-time benefits since the complexity of the operators is diminished (no offset subtraction is required at run-time)
- AsymmetricWithPower2Scale - this means the scale quantization parameter is a power of 2 (e.g. 0.5 or 0.125 or 1.0 or 2.0) which also has some run-time benefits since some operations are resolved to up/down bitwise shifts.
- SymmetricWithPower2Scale - this means the offset parameter is 0 and the scale parameter is a power of 2.
- To be noted that different quantization schemas have different tradeoffs between arithmetic accuracy and run-time complexity.
The thing I don’t like is that it seems that some new ONNX operators define quantization support by embedding the quantization information in the operators themselves instead of the tensor types. Such examples are ConvInteger and MatMulInteger where some quantization information is embedded in attributes like x_zero_point or w_zero_point. I don’t think this is the right approach since this means new operators will be defined just for the purpose of allowing quantized data (although the meaning of the operators is the same) while the quantized nature should be embedded in the tensor types.
In my opinion the operator definition should remain the same while adding new tensor types for quantized data. The rest of the implementation details (how the operators translate to run-time implementations) should be left to the backends supporting the ONNX format. An example of quantized tensor type would be:

    type {
      tensor_type {
        elem_type: xx            // Identifier for the quantized types "int8q" or "int16q" or "int32q"
        scale: 0.23                // scale parameter
        offset: -10                // offset parameter
        shape {
          dim {
            dim_value: 5
          }
        }
      }
    }

Worth mentioning is that a new degree of generalization should allow the so called sub-tensor quantization. The examples above assume all the values of a tensor are quantized using the same parameters (scale and offset). Some applications require the quantization of tensors with smaller granularity meaning that sub-parts of the tensor are quantized using different parameters for better encoding of the values. For example this mechanism is beneficial for the so-called depth-wise convolution where each slice of the weight tensor has an independent histogram is better suited for representation by using their own scale & offset. For example a weight matrix could be quantized row-wise meaning each row will be quantized separately, resulting in a matrix of int8 values but with different quantization parameters for each row. This more general case could also be included in the proto format by having a quantization axis (or axes) and a vector of scales and offsets.

I am very interested in the future of the ONNX format and I would like to talk more about this. What do you think? It would be nice if I could have a technical discussion about this feature with someone. Best regards.

Issue Analytics

State:
Created 4 years ago
Comments:8 (4 by maintainers)

Top GitHub Comments

3reactions

sousouxcommented, Dec 26, 2020

We are now supporting ONNX as an input format for our toolchain https://github.com/GreenWaves-Technologies/gap_sdk/tree/master/tools/nntool. Having spent a lot of time getting quantization support working well with TFLITE, whether imported or determined by our tool, I would say that the comments above on tensor scale and offset annotation are correct. However I would go one step further. If ONNX is an export/import format (and not primarily a runner format) then the most important element of quantization is missing. That is not the ability to directly run the ONNX model quantized but to be able to determine the statistics on the activations without any training data. If ONNX was just to add min/mix/av/std information in all variable tensors that would go 100% of the way for allowing anyone to implement the quantization scheme that they wanted to. If ONNX supported that and some type of function annotation that at least preserves meta information about what a collection of operators was decomposed from then ONNX would (at least for us) become the export format of choice. I’m happy to participate in a working group to flesh this out if anyone is interested. Neither of these points involve changing how any of the ONNX operators work - just the addition of meta info that helps people producing backends to better map onto our optimized operators.

0reactions

stale[bot]commented, Mar 28, 2022

Is this still relevant? If so, what is blocking it? Is there anything you can do to help move it forward?

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs.

Top Results From Across the Web

[RFC] A Proposal for Implementing Quantization ...

Goals: This doc is to provide a specific way of representing quantization constraints and implementing quantization transformations in MLIR ...

Reproducing arithmetic with pytorch's quantized tensors with ...

I have found. That nn.quantized.FloatFunctional().add() is the more difficult to understand than convoulitonal layer and fully connected layer ( ...

Quantized Tensor Neural Network - ACM Digital Library

In this article, we further propose a quantized tensor neural network (QTNN) to improve the learning ability of the tensor network, which effectively...

Quantization — PyTorch 1.13 documentation

A quantized model executes some or all of the operations on tensors with reduced precision rather than full precision (floating point) values. This...

Deep Dive on PyTorch Quantization - Chris Gottbrath - YouTube

To support more efficient deployment on servers and edge devices, PyTorch added a support for model quantization using the familiar eager ...