question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

ONNX tests failing on master

See original GitHub issue

🐛 Bug

I seems that the ONNX tests are failing today on the latest master and the problem is probably related to changes upstream.

This was originally spotted on an unrelated PR but to confirm we reran the tests on previously day’s passing master and it failed with the following errors:

======================================================================
ERROR: test_faster_rcnn (__main__.ONNXExporterTester)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "test/test_onnx.py", line 376, in test_faster_rcnn
    tolerate_small_mismatch=True)
  File "test/test_onnx.py", line 53, in run_model
    self.ort_validate(onnx_io, test_inputs, test_ouputs, tolerate_small_mismatch)
  File "test/test_onnx.py", line 72, in ort_validate
    ort_outs = ort_session.run(None, ort_inputs)
  File "/home/circleci/.local/lib/python3.7/site-packages/onnxruntime/capi/onnxruntime_inference_collection.py", line 124, in run
    return self._sess.run(output_names, input_feed, run_options)
onnxruntime.capi.onnxruntime_pybind11_state.RuntimeException: [ONNXRuntimeError] : 6 : RUNTIME_EXCEPTION : Non-zero status code returned while running ReduceMax node. Name:'ReduceMax_1833' Status Message: /onnxruntime_src/onnxruntime/core/providers/cpu/reduction/reduction_ops.cc:487 void onnxruntime::CommonReduce(onnxruntime::OpKernelContext*, std::vector<long int>, int64_t, onnxruntime::ResultsNoTransposePrepareForReduce&, bool) [with T = float; AGG = onnxruntime::ReduceAggregatorMax<float, float>; int64_t = long int] keepdims_ was false. Can't reduce on dim with value of 0 if 'keepdims' is false. Invalid output shape would be produced. input_shape:{0,4}


======================================================================
ERROR: test_keypoint_rcnn (__main__.ONNXExporterTester)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "test/test_onnx.py", line 477, in test_keypoint_rcnn
    tolerate_small_mismatch=True)
  File "test/test_onnx.py", line 53, in run_model
    self.ort_validate(onnx_io, test_inputs, test_ouputs, tolerate_small_mismatch)
  File "test/test_onnx.py", line 72, in ort_validate
    ort_outs = ort_session.run(None, ort_inputs)
  File "/home/circleci/.local/lib/python3.7/site-packages/onnxruntime/capi/onnxruntime_inference_collection.py", line 124, in run
    return self._sess.run(output_names, input_feed, run_options)
onnxruntime.capi.onnxruntime_pybind11_state.RuntimeException: [ONNXRuntimeError] : 6 : RUNTIME_EXCEPTION : Non-zero status code returned while running ReduceMax node. Name:'ReduceMax_1833' Status Message: /onnxruntime_src/onnxruntime/core/providers/cpu/reduction/reduction_ops.cc:487 void onnxruntime::CommonReduce(onnxruntime::OpKernelContext*, std::vector<long int>, int64_t, onnxruntime::ResultsNoTransposePrepareForReduce&, bool) [with T = float; AGG = onnxruntime::ReduceAggregatorMax<float, float>; int64_t = long int] keepdims_ was false. Can't reduce on dim with value of 0 if 'keepdims' is false. Invalid output shape would be produced. input_shape:{0,4}


======================================================================
ERROR: test_mask_rcnn (__main__.ONNXExporterTester)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "test/test_onnx.py", line 429, in test_mask_rcnn
    tolerate_small_mismatch=True)
  File "test/test_onnx.py", line 53, in run_model
    self.ort_validate(onnx_io, test_inputs, test_ouputs, tolerate_small_mismatch)
  File "test/test_onnx.py", line 72, in ort_validate
    ort_outs = ort_session.run(None, ort_inputs)
  File "/home/circleci/.local/lib/python3.7/site-packages/onnxruntime/capi/onnxruntime_inference_collection.py", line 124, in run
    return self._sess.run(output_names, input_feed, run_options)
onnxruntime.capi.onnxruntime_pybind11_state.RuntimeException: [ONNXRuntimeError] : 6 : RUNTIME_EXCEPTION : Non-zero status code returned while running ReduceMax node. Name:'ReduceMax_1833' Status Message: /onnxruntime_src/onnxruntime/core/providers/cpu/reduction/reduction_ops.cc:487 void onnxruntime::CommonReduce(onnxruntime::OpKernelContext*, std::vector<long int>, int64_t, onnxruntime::ResultsNoTransposePrepareForReduce&, bool) [with T = float; AGG = onnxruntime::ReduceAggregatorMax<float, float>; int64_t = long int] keepdims_ was false. Can't reduce on dim with value of 0 if 'keepdims' is false. Invalid output shape would be produced. input_shape:{0,4}

cc @neginraoof, @spandantiwari , @jiafatom

Issue Analytics

  • State:closed
  • Created 3 years ago
  • Comments:11 (11 by maintainers)

github_iconTop GitHub Comments

2reactions
jiafatomcommented, Jan 15, 2021

@datumbox I have a PR to fix this issue on upstream: https://github.com/pytorch/pytorch/pull/50582 I imported the above three torch vision test into pytorch test, and it passed locally, and torchvision test looks good “mostly” (see detail [A] below)

It need some time for this PR get merged. For current policy with Facebook, we merge to pytorch branch when we have ~10 PRs in a batch. So we estimate this PR merge may happen in around 10-14 days. That means torch_vision test_onnx will still be red during this time. Do you have any comments on this? Thanks.

Detail [A]: When I test torch vision test against this PR, it passed test_faster_rcnn and test_mask_rcnn, fails on test_keypoint_rcnn on a single data point out of 561: With rtol=0.001 and atol=1e-05, found 1 element(s) (out of 561) whose difference(s) exceeded the margin of error (including 0 nan comparisons). The greatest difference was 2.6005320250988007e-05 (-0.014647360891103745 vs. -0.014621355570852757), which occurred at index (29, 4).

The difference is around rtol=0.0017 and atol = 2.7e-5, slightly larger than the bound rtol=0.001 and atol=1e-05. I feel it is acceptable - we can relax the error bar to unblock torch vision UT. Further analysis is a separate issue.

1reaction
datumboxcommented, Jan 18, 2021

@jiafatom Thanks for looking into it.

We are currently completing the work of including FasterRCNN with MobileNetV3 backbone (#3253). Given that this bug affects the tests of *rcnn models, it makes it hard to confirm that the new model will be ONNX compatible. I wonder if your team could bring the PR faster as an exception for this use-case?

Read more comments on GitHub >

github_iconTop Results From Across the Web

TensorRT/ONNX
Some performance tests about dynamic shape with onnx model. Test environment. GPU: T4 TensorRT: 7.0. CUDA: 10.2. MobilenetV2 ...
Read more >
ONNX Runtime Addition, master branch (2020.03.02.)
An error occurred while retrieving approval data for this merge request. ONNX Runtime Addition, master branch (2020.03.02.).
Read more >
Error on running Super Resolution Model from ONNX
Given model could not be parsed while creating inference session. Error message: Protobuf parsing failed. How can I solve the error? python ...
Read more >
Onnx to tensorrt conversion fails
Description. trtexec --onnx=my_model.onnx --batch=1 --saveEngine=test.engine --verbose fails with the below error
Read more >
Deploy and make predictions with ONNX - SQL machine ...
In this article. Before you begin; Train a pipeline; Convert the model to ONNX; Test the ONNX model ...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found