question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Encoding BFLOAT16 Constant to ONNX Fails

See original GitHub issue

Bug Report

Is the issue related to model conversion?

no

Describe the bug

BFLOAT16 constants are encoded incorrectly when creating tensor initialization data via ONNX Python support.

This feature was added in v1.11.0 so you need at least that version to reproduce the problem.

I’ve been able to trace the problem to the following code in helper.py. In make_tensor there is this code

        # floa16/bfloat16 are stored as uint16
        elif (data_type == TensorProto.FLOAT16
                or data_type == TensorProto.BFLOAT16):
            vals = np.array(vals).astype(np.float16).view(dtype=np.uint16).flatten().tolist()

Notice that the code is converting the value to np.float16 before writing it out; this is wrong for BFLOAT16 since it doesn’t have the same encoding. The result is that this code will write out data in FLOAT16 format when the type is BFLOAT16.

System information

  • OS Platform and Distribution (e.g. Linux Ubuntu 16.04): N/A
  • ONNX version (e.g. 1.7): 1.11.0
  • Python version: N/A
  • GCC/Compiler version (if compiling from source): N/A
  • CMake version: N/A
  • Protobuf version: N/A
  • Visual Studio version (if applicable): N/A

Reproduction instructions

The following python program reproduces the error when run with ONNX v1.11.0

The program prints the incorrect initialization data and writes out the bad graph to disk (as test.onnx)

import onnx
from onnx import helper, TensorProto

type =  TensorProto.BFLOAT16
dims = [2,2]
input = helper.make_tensor_value_info('input', type, dims, False)

c0 = helper.make_tensor_value_info('c0', type, dims, False)
initializer = helper.make_tensor(c0.name, type, dims, [1.0, 2.0, 3.0, 4.0])

# dims: 2              
# dims: 2              
# data_type: 16        
# int32_data: 15360    | CORRECT_VALUE: 16256 (0x3F80)
# int32_data: 16384    | CORRECT_VALUE: 16384 (0x4000)
# int32_data: 16896    | CORRECT_VALUE: 16448 (0x4040)
# int32_data: 17408    | CORRECT_VALUE: 16512 (0x4080)
# name: "c0"           
print(initializer)

output = helper.make_tensor_value_info('output', type, dims)
add = helper.make_node('Add', [input.name, c0.name], [output.name], 'Add')

graph = helper.make_graph([add], 'test', [input], [output], initializer=[initializer])
model = helper.make_model(graph)

onnx.save(model, 'test.onnx')

Expected behavior

BFLOAT16 should be encoded correctly and the program should print:

dims: 2              
dims: 2              
data_type: 16        
int32_data: 16256
int32_data: 16384
int32_data: 16448
int32_data: 16512
name: "c0"           

Notes

There was no FLOAT16 or BFLOAT16 support prior to v1.11.0. Starting in v1.11.0 support was added but the BFLOAT16 encoding support has a bug.

Issue Analytics

  • State:closed
  • Created a year ago
  • Reactions:1
  • Comments:5 (5 by maintainers)

github_iconTop GitHub Comments

1reaction
manbeariancommented, May 9, 2022

Back after disappearing last week while i closed the issue on my end that exposed this ONNX issue.

@jcwchen , i think its possible to create a BFLOAT16 constant here, i’m optimistic that i will have time this week to verify and code it up. The tricky part will probably be getting the rounding correct.

If this drags on, if others hit this issue, or you have a new release imminent that can take a change here, i suggest perhaps issuing an error on BFLOAT16 and recommending folks use the waw encoding feature to write out BFLOAT16 (that was the workaround i used on my side).

0reactions
manbeariancommented, May 10, 2022

Possible fix has been published here: https://github.com/onnx/onnx/pull/4193

Read more comments on GitHub >

github_iconTop Results From Across the Web

Scan - 11 vs 16 — ONNX 1.12.0 documentation
be encoded as a ScanLoop. Note that the loop-body is a nested graph, and it directly computes. 77. 77. %Wi, %Ri, %Wbi, and...
Read more >
Python Runtime for ONNX operators
Absolute takes one input data (Tensor<T>) and produces one output data (Tensor<T>) where the absolute is, y = abs(x), is applied to the...
Read more >
End-to-end Model Inference and Training on Gemmini
The ONNX model format encodes a network in terms of its representation as a directed acyclic graph of operations on tensors – with...
Read more >
Optimizing the T5 Model for Fast Inference - DataToBiz
We first all the nodes of the ONNX encoder graph to float 16 and tried ... compatible with float16 but with bfloat16 and...
Read more >
Cannot export model in bfp16 to ONNX - PyTorch Forums
... bfp16 and export it using torch.onnx.export, but got the following error ... return_tensors='pt') export_encoder(model.model.encoder, ...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found