question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Libnd4j: conv2d op (and MKL-DNN-enabled conv2d) slower than DL4J implementation

See original GitHub issue

Edit: Windows 10, 8 core 5960x CPU

Here’s a simple benchmark comparing DL4J ConvolutionLayer implementation (no MKL-DNN) with conv2d op - with and without MKL-DNN enabled: https://gist.github.com/AlexDBlack/31d2d2ce5fdb04e6dcbc0b80e6187f88

Updated 28/02/19:

DL4J ConvolutionLayer: average 7.79 ms
conv2d op (use mkl=false): average 19.04 ms
conv2d op (use mkl=true): average 689.96 ms


Edit: Subsampling/pooling test + results: https://gist.github.com/AlexDBlack/b1f5f32e80b631321fe9936814fd8534 Updated 28/02/19

max pooling
DL4J SubsamplingLayer: average 1.1 ms
maxpool2d op (use mkl=false): average 1.09 ms
maxpool2d op (use mkl=true): average 16.86 ms
-----------------
avg pooling
DL4J SubsamplingLayer: average 0.85 ms
avgpool2d op (use mkl=false): average 3.43 ms
avgpool2d op (use mkl=true): average 14.37 ms


Batch norm forward pass test + results: https://gist.github.com/AlexDBlack/e46cf50de14252ac0d43e7a813d6a045

Updated 28/02/2018: batchnorm now faster for both DL4J and libnd4j vs. earlier.

DL4J BatchNormalization: average 4.97 ms
batchnorm_new op (use mkl=false): average 2.33 ms
batchnorm_new op (use mkl=true): average 2.83 ms


Edit 28/02/19: LRN results https://gist.github.com/AlexDBlack/88ab2529a73166b9955c28e8f83a61ef

DL4J LRN: average 34.52 ms
lrn op (use mkl=false): average 14.67 ms
lrn op (use mkl=true): average 5.08 ms


27/02/19: DL4J LSTM vs. lstmBlock op: (note: no mkl-dnn support yet)

DL4J LSTM layer: average 11.33 ms
lstmBlock op: average 6.53 ms

https://gist.github.com/AlexDBlack/8d01ee6e9c42ecf8fd8f988d16698bf6



28/02/19: Softmax op:

Legacy softmax op: average 2.225 ms
softmax custom op: average 0.321 ms

https://gist.github.com/AlexDBlack/88a02e91a9b8e9e93f8da5ce0901d3f6

Issue Analytics

  • State:closed
  • Created 5 years ago
  • Comments:18 (17 by maintainers)

github_iconTop GitHub Comments

2reactions
saudetcommented, Mar 4, 2019

I found what the problem was. We need to add reorder() operations manually to MKL-DNN streams or it falls back on reference implementations (non-JIT) of the other operations. For now, conv2d and conv2d_bp are done in sa_mkldnn, and I get this kind of output with https://gist.github.com/AlexDBlack/31d2d2ce5fdb04e6dcbc0b80e6187f88:

DL4J ConvolutionLayer: average 9.05 ms
conv2d op (use mkl=false): average 7.96 ms
conv2d op (use mkl=true): average 3.12 ms

Though it appears that the nd4j::graph::Context still gets recreated on each call to Nd4j.exec(op). If I hack it to use only one static stream, I get values below 2.5 ms.

To make sure we’re executing with JIT, we can set the MKLDNN_VERBOSE environment variable to 1. We should see messages like these containing “jit” in them like this:

mkldnn_verbose,exec,reorder,jit:uni,undef,in:f32_nChw8c out:f32_blocked,num:1,8x32x64x64,0.538086
mkldnn_verbose,exec,reorder,jit:uni,undef,in:f32_blocked out:f32_nChw8c,num:1,8x32x64x64,0.624023
mkldnn_verbose,exec,reorder,jit:uni,undef,in:f32_blocked out:f32_OIhw8i8o,num:1,32x32x2x2,0.0620117
mkldnn_verbose,exec,convolution,jit:avx2,forward_training,fsrc:nChw8c fwei:OIhw8i8o fbia:x fdst:nChw8c,alg:convolution_direct,mb8_g1ic32oc32_ih64oh64kh2sh1dh0ph0_iw64ow64kw2sw1dw0pw0,1.15381

If we see “ref” instead, those will be slow.

0reactions
lock[bot]commented, Apr 5, 2019

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Performance Issues - Deeplearning4j
Poor CPU/GPU utilization. 2. Slower than expected training or operation execution. To start, here's a summary of some possible causes of performance issues:....
Read more >
tf.nn.separable_conv2d is slower than conv2d on GPU #12940
I implemented UNet with separable conv2d and it was around 80% slower than using a standard conv2d. Is there anything in the works...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found