Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

ONNX tests failing on master

See original GitHub issue

🐛 Bug

I seems that the ONNX tests are failing today on the latest master and the problem is probably related to changes upstream.

This was originally spotted on an unrelated PR but to confirm we reran the tests on previously day’s passing master and it failed with the following errors:

======================================================================
ERROR: test_faster_rcnn (__main__.ONNXExporterTester)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "test/test_onnx.py", line 376, in test_faster_rcnn
    tolerate_small_mismatch=True)
  File "test/test_onnx.py", line 53, in run_model
    self.ort_validate(onnx_io, test_inputs, test_ouputs, tolerate_small_mismatch)
  File "test/test_onnx.py", line 72, in ort_validate
    ort_outs = ort_session.run(None, ort_inputs)
  File "/home/circleci/.local/lib/python3.7/site-packages/onnxruntime/capi/onnxruntime_inference_collection.py", line 124, in run
    return self._sess.run(output_names, input_feed, run_options)
onnxruntime.capi.onnxruntime_pybind11_state.RuntimeException: [ONNXRuntimeError] : 6 : RUNTIME_EXCEPTION : Non-zero status code returned while running ReduceMax node. Name:'ReduceMax_1833' Status Message: /onnxruntime_src/onnxruntime/core/providers/cpu/reduction/reduction_ops.cc:487 void onnxruntime::CommonReduce(onnxruntime::OpKernelContext*, std::vector<long int>, int64_t, onnxruntime::ResultsNoTransposePrepareForReduce&, bool) [with T = float; AGG = onnxruntime::ReduceAggregatorMax<float, float>; int64_t = long int] keepdims_ was false. Can't reduce on dim with value of 0 if 'keepdims' is false. Invalid output shape would be produced. input_shape:{0,4}


======================================================================
ERROR: test_keypoint_rcnn (__main__.ONNXExporterTester)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "test/test_onnx.py", line 477, in test_keypoint_rcnn
    tolerate_small_mismatch=True)
  File "test/test_onnx.py", line 53, in run_model
    self.ort_validate(onnx_io, test_inputs, test_ouputs, tolerate_small_mismatch)
  File "test/test_onnx.py", line 72, in ort_validate
    ort_outs = ort_session.run(None, ort_inputs)
  File "/home/circleci/.local/lib/python3.7/site-packages/onnxruntime/capi/onnxruntime_inference_collection.py", line 124, in run
    return self._sess.run(output_names, input_feed, run_options)
onnxruntime.capi.onnxruntime_pybind11_state.RuntimeException: [ONNXRuntimeError] : 6 : RUNTIME_EXCEPTION : Non-zero status code returned while running ReduceMax node. Name:'ReduceMax_1833' Status Message: /onnxruntime_src/onnxruntime/core/providers/cpu/reduction/reduction_ops.cc:487 void onnxruntime::CommonReduce(onnxruntime::OpKernelContext*, std::vector<long int>, int64_t, onnxruntime::ResultsNoTransposePrepareForReduce&, bool) [with T = float; AGG = onnxruntime::ReduceAggregatorMax<float, float>; int64_t = long int] keepdims_ was false. Can't reduce on dim with value of 0 if 'keepdims' is false. Invalid output shape would be produced. input_shape:{0,4}


======================================================================
ERROR: test_mask_rcnn (__main__.ONNXExporterTester)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "test/test_onnx.py", line 429, in test_mask_rcnn
    tolerate_small_mismatch=True)
  File "test/test_onnx.py", line 53, in run_model
    self.ort_validate(onnx_io, test_inputs, test_ouputs, tolerate_small_mismatch)
  File "test/test_onnx.py", line 72, in ort_validate
    ort_outs = ort_session.run(None, ort_inputs)
  File "/home/circleci/.local/lib/python3.7/site-packages/onnxruntime/capi/onnxruntime_inference_collection.py", line 124, in run
    return self._sess.run(output_names, input_feed, run_options)
onnxruntime.capi.onnxruntime_pybind11_state.RuntimeException: [ONNXRuntimeError] : 6 : RUNTIME_EXCEPTION : Non-zero status code returned while running ReduceMax node. Name:'ReduceMax_1833' Status Message: /onnxruntime_src/onnxruntime/core/providers/cpu/reduction/reduction_ops.cc:487 void onnxruntime::CommonReduce(onnxruntime::OpKernelContext*, std::vector<long int>, int64_t, onnxruntime::ResultsNoTransposePrepareForReduce&, bool) [with T = float; AGG = onnxruntime::ReduceAggregatorMax<float, float>; int64_t = long int] keepdims_ was false. Can't reduce on dim with value of 0 if 'keepdims' is false. Invalid output shape would be produced. input_shape:{0,4}

cc @neginraoof, @spandantiwari , @jiafatom

Issue Analytics

State:
Created 3 years ago
Comments:11 (11 by maintainers)

Top GitHub Comments

2reactions

jiafatomcommented, Jan 15, 2021

@datumbox I have a PR to fix this issue on upstream: https://github.com/pytorch/pytorch/pull/50582 I imported the above three torch vision test into pytorch test, and it passed locally, and torchvision test looks good “mostly” (see detail [A] below)

It need some time for this PR get merged. For current policy with Facebook, we merge to pytorch branch when we have ~10 PRs in a batch. So we estimate this PR merge may happen in around 10-14 days. That means torch_vision test_onnx will still be red during this time. Do you have any comments on this? Thanks.

Detail [A]: When I test torch vision test against this PR, it passed test_faster_rcnn and test_mask_rcnn, fails on test_keypoint_rcnn on a single data point out of 561: With rtol=0.001 and atol=1e-05, found 1 element(s) (out of 561) whose difference(s) exceeded the margin of error (including 0 nan comparisons). The greatest difference was 2.6005320250988007e-05 (-0.014647360891103745 vs. -0.014621355570852757), which occurred at index (29, 4).

The difference is around rtol=0.0017 and atol = 2.7e-5, slightly larger than the bound rtol=0.001 and atol=1e-05. I feel it is acceptable - we can relax the error bar to unblock torch vision UT. Further analysis is a separate issue.

1reaction

datumboxcommented, Jan 18, 2021

@jiafatom Thanks for looking into it.

We are currently completing the work of including FasterRCNN with MobileNetV3 backbone (#3253). Given that this bug affects the tests of *rcnn models, it makes it hard to confirm that the new model will be ONNX compatible. I wonder if your team could bring the PR faster as an exception for this use-case?