Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Performance will be improved by setting input strides=output strides for Clip in DirectMLX

See original GitHub issue

I am investigating for the performance of MobileNet V2 from TFLite models with “nhwc” layout and MobileNet V2 from ONNX models with “nchw” layout on the implementation with DirectML and DirectMLX API.

I find that nhwc MobileNetV2 model has lots of Clip after Conv2d, the Clip will cost much time on inference. I guess that the Clip will do memory copy and hasn’t be optimized in compilation stage.

I have a workaround to resolve this problem: set Clip’s input strides same as its’ output strides by changing this line to TensorDesc outputTensor = inputTensor in DirectMLX.h, the Clip will be optimized just like fused into Conv2d, and then the inference time will be significantly reduced to be as same as nchw MobileNetV2.

When building nhwc MobileNetV2 model, we need append Identity after each Conv2d to transpose output tensor from default nchw to nhwc, then transpose this output tensor from nhwc to nchw as the next Conv2d’s input tensor. In my opinion, I suppose that the Identity and Reinterpret can be optimized by DML in this model like: Conv0->Identity(nchw->nhwc)->Reinterpret strides(nhwc->nchw)->Conv1 just like transpose sinking in OpenVINO backend.

I guess that the Identity and Reinterpret sinking may be blocked when there is Clip like: Conv0->Identity(nchw->nhwc)->Clip->Reinterpret strides(nhwc->nchw)->Conv1 . I verified that if I remove Identity to run Conv0->Reinterpret strides(nchw->nhwc)->Clip(input strides = output strides)->Reinterpret strides(nhwc->nchw)->Conv1 , the inference time will be much lower than before.

So in conclusion, I suggest setting Clip’s input strides same as its’ output strides by changing this line to TensorDesc outputTensor = inputTensor in DirectMLX.h.