Libnd4j: conv2d op (and MKL-DNN-enabled conv2d) slower than DL4J implementation
See original GitHub issueEdit: Windows 10, 8 core 5960x CPU
Here’s a simple benchmark comparing DL4J ConvolutionLayer implementation (no MKL-DNN) with conv2d op - with and without MKL-DNN enabled: https://gist.github.com/AlexDBlack/31d2d2ce5fdb04e6dcbc0b80e6187f88
Updated 28/02/19:
DL4J ConvolutionLayer: average 7.79 ms
conv2d op (use mkl=false): average 19.04 ms
conv2d op (use mkl=true): average 689.96 ms
Edit: Subsampling/pooling test + results: https://gist.github.com/AlexDBlack/b1f5f32e80b631321fe9936814fd8534 Updated 28/02/19
max pooling
DL4J SubsamplingLayer: average 1.1 ms
maxpool2d op (use mkl=false): average 1.09 ms
maxpool2d op (use mkl=true): average 16.86 ms
-----------------
avg pooling
DL4J SubsamplingLayer: average 0.85 ms
avgpool2d op (use mkl=false): average 3.43 ms
avgpool2d op (use mkl=true): average 14.37 ms
Batch norm forward pass test + results: https://gist.github.com/AlexDBlack/e46cf50de14252ac0d43e7a813d6a045
Updated 28/02/2018: batchnorm now faster for both DL4J and libnd4j vs. earlier.
DL4J BatchNormalization: average 4.97 ms
batchnorm_new op (use mkl=false): average 2.33 ms
batchnorm_new op (use mkl=true): average 2.83 ms
Edit 28/02/19: LRN results https://gist.github.com/AlexDBlack/88ab2529a73166b9955c28e8f83a61ef
DL4J LRN: average 34.52 ms
lrn op (use mkl=false): average 14.67 ms
lrn op (use mkl=true): average 5.08 ms
27/02/19: DL4J LSTM vs. lstmBlock op: (note: no mkl-dnn support yet)
DL4J LSTM layer: average 11.33 ms
lstmBlock op: average 6.53 ms
https://gist.github.com/AlexDBlack/8d01ee6e9c42ecf8fd8f988d16698bf6
28/02/19: Softmax op:
Legacy softmax op: average 2.225 ms
softmax custom op: average 0.321 ms
https://gist.github.com/AlexDBlack/88a02e91a9b8e9e93f8da5ce0901d3f6
Issue Analytics
- State:
- Created 5 years ago
- Comments:18 (17 by maintainers)
I found what the problem was. We need to add reorder() operations manually to MKL-DNN streams or it falls back on reference implementations (non-JIT) of the other operations. For now, conv2d and conv2d_bp are done in sa_mkldnn, and I get this kind of output with https://gist.github.com/AlexDBlack/31d2d2ce5fdb04e6dcbc0b80e6187f88:
Though it appears that the
nd4j::graph::Context
still gets recreated on each call toNd4j.exec(op)
. If I hack it to use only one static stream, I get values below2.5 ms
.To make sure we’re executing with JIT, we can set the
MKLDNN_VERBOSE
environment variable to 1. We should see messages like these containing “jit” in them like this:If we see “ref” instead, those will be slow.
This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.