Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Encoding BFLOAT16 Constant to ONNX Fails

See original GitHub issue

Bug Report

Is the issue related to model conversion?

Describe the bug

BFLOAT16 constants are encoded incorrectly when creating tensor initialization data via ONNX Python support.

This feature was added in v1.11.0 so you need at least that version to reproduce the problem.

I’ve been able to trace the problem to the following code in helper.py. In make_tensor there is this code

        # floa16/bfloat16 are stored as uint16
        elif (data_type == TensorProto.FLOAT16
                or data_type == TensorProto.BFLOAT16):
            vals = np.array(vals).astype(np.float16).view(dtype=np.uint16).flatten().tolist()

Notice that the code is converting the value to np.float16 before writing it out; this is wrong for BFLOAT16 since it doesn’t have the same encoding. The result is that this code will write out data in FLOAT16 format when the type is BFLOAT16.

System information

OS Platform and Distribution (e.g. Linux Ubuntu 16.04): N/A
ONNX version (e.g. 1.7): 1.11.0
Python version: N/A
GCC/Compiler version (if compiling from source): N/A
CMake version: N/A
Protobuf version: N/A
Visual Studio version (if applicable): N/A

Reproduction instructions

The following python program reproduces the error when run with ONNX v1.11.0

The program prints the incorrect initialization data and writes out the bad graph to disk (as test.onnx)

import onnx
from onnx import helper, TensorProto

type =  TensorProto.BFLOAT16
dims = [2,2]
input = helper.make_tensor_value_info('input', type, dims, False)

c0 = helper.make_tensor_value_info('c0', type, dims, False)
initializer = helper.make_tensor(c0.name, type, dims, [1.0, 2.0, 3.0, 4.0])

# dims: 2              
# dims: 2              
# data_type: 16        
# int32_data: 15360    | CORRECT_VALUE: 16256 (0x3F80)
# int32_data: 16384    | CORRECT_VALUE: 16384 (0x4000)
# int32_data: 16896    | CORRECT_VALUE: 16448 (0x4040)
# int32_data: 17408    | CORRECT_VALUE: 16512 (0x4080)
# name: "c0"           
print(initializer)

output = helper.make_tensor_value_info('output', type, dims)
add = helper.make_node('Add', [input.name, c0.name], [output.name], 'Add')

graph = helper.make_graph([add], 'test', [input], [output], initializer=[initializer])
model = helper.make_model(graph)

onnx.save(model, 'test.onnx')

Expected behavior

BFLOAT16 should be encoded correctly and the program should print:

dims: 2              
dims: 2              
data_type: 16        
int32_data: 16256
int32_data: 16384
int32_data: 16448
int32_data: 16512
name: "c0"

Notes

There was no FLOAT16 or BFLOAT16 support prior to v1.11.0. Starting in v1.11.0 support was added but the BFLOAT16 encoding support has a bug.

Issue Analytics

State:
Created a year ago
Reactions:1
Comments:5 (5 by maintainers)

Top GitHub Comments

1reaction

manbeariancommented, May 9, 2022

Back after disappearing last week while i closed the issue on my end that exposed this ONNX issue.

@jcwchen , i think its possible to create a BFLOAT16 constant here, i’m optimistic that i will have time this week to verify and code it up. The tricky part will probably be getting the rounding correct.

If this drags on, if others hit this issue, or you have a new release imminent that can take a change here, i suggest perhaps issuing an error on BFLOAT16 and recommending folks use the waw encoding feature to write out BFLOAT16 (that was the workaround i used on my side).

0reactions

manbeariancommented, May 10, 2022

Possible fix has been published here: https://github.com/onnx/onnx/pull/4193