Performance will be improved by setting input strides=output strides for Clip in DirectMLX
See original GitHub issueI am investigating for the performance of MobileNet V2 from TFLite models with “nhwc” layout and MobileNet V2 from ONNX models with “nchw” layout on the implementation with DirectML and DirectMLX API.
I find that nhwc MobileNetV2 model has lots of Clip after Conv2d, the Clip will cost much time on inference. I guess that the Clip will do memory copy and hasn’t be optimized in compilation stage.
I have a workaround to resolve this problem: set Clip’s input strides same as its’ output strides by changing this line to TensorDesc outputTensor = inputTensor in DirectMLX.h, the Clip will be optimized just like fused into Conv2d, and then the inference time will be significantly reduced to be as same as nchw MobileNetV2.
When building nhwc MobileNetV2 model, we need append Identity after each Conv2d to transpose output tensor from default nchw to nhwc, then transpose this output tensor from nhwc to nchw as the next Conv2d’s input tensor. In my opinion, I suppose that the Identity and Reinterpret can be optimized by DML in this model like:
Conv0->Identity(nchw->nhwc)->Reinterpret strides(nhwc->nchw)->Conv1
just like transpose sinking in OpenVINO backend.
I guess that the Identity and Reinterpret sinking may be blocked when there is Clip like:
Conv0->Identity(nchw->nhwc)->Clip->Reinterpret strides(nhwc->nchw)->Conv1 .
I verified that if I remove Identity to run
Conv0->Reinterpret strides(nchw->nhwc)->Clip(input strides = output strides)->Reinterpret strides(nhwc->nchw)->Conv1
, the inference time will be much lower than before.
So in conclusion, I suggest setting Clip’s input strides same as its’ output strides by changing this line to TensorDesc outputTensor = inputTensor
in DirectMLX.h.
Issue Analytics
- State:
- Created a year ago
- Comments:8 (8 by maintainers)
Top GitHub Comments
Yes, close it, thanks! @fdwr @adtsai
Thanks for your detailed explanation and suggestions, very helpful! @adtsai