[Performance] GSpMM regression with FP16
See original GitHub issueI spent some time profiling the GAT example with AMP in https://docs.dgl.ai/en/0.9.x/guide/mixed_precision.html and want to know why we didn’t obtain performance gain from FP16. I observed regression in both the forward and backward phases.
Performance on A100
AMP | FP32 | |
---|---|---|
forward | 254 ms | 160 ms |
backward | 482 ms | 369 ms |
step | 738 ms | 529 ms |
Take the forward phase for example, there are three EdgeSoftmax-GSpMM
pairs.
- AMP
- FP32
Elapsed times (ms) of the EdgeSoftmax-GSpMM
pairs x3.
AMP | FP32 | |
---|---|---|
EdgeSoftmax1 | 20.896 | 22.055 |
GSpMM1 | 73.851 | 31.890 |
EdgeSoftmax2 | 20.865 | 22.080 |
GSpMM2 | 73.797 | 31.882 |
EdgeSoftmax3 | 20.869 | 22.154 |
GSpMM3 | 27.528 | 13.422 |
We can see that the EdgeSoftmax
is a bit faster with AMP while it’s GSpMM
that results in the regression.
Dig it a bit deeper, GSpMM
with FP32 will go through the cusparse::csrmm
code path while adopting dgl::aten::cuda::SpMMCsrKernel
with FP16 inputs. I’ll look into the performance issue on cusparse with FP16.
ENV
- CUDA_VERSION: 11.7.0.022
- CUDA_DRIVER_VERSION: 515.43.04
- CUSPARSE_VERSION: 11.7.3.50
- GPU: NVIDIA A100 80GB PCIe
- PyTorch: 1.12.0a0+8a1a93a
Issue Analytics
- State:
- Created a year ago
- Comments:10 (4 by maintainers)
Top Results From Across the Web
Train With Mixed Precision - NVIDIA Documentation Center
To achieve optimum performance, you need to train a model using Tensor Core math and FP16 mode on Microsoft Cognitive Toolkit. 7.5.1. Running ......
Read more >CogDL: A Toolkit for Deep Learning on Graphs - arXiv
TABLE 3: Performance of mixed precision training. Memory. Accuracy. Training Speed. 2080 Ti. 3090 w/o fp16. 5,567 MB.
Read more >What Every User Should Know About Mixed Precision ...
The performance gain of mixed precision training can depend on multiple factors (e.g. compute-bound vs memory-bound problems) and users should ...
Read more >yzh119 (yzh119) - PullAnswer
[Performance] GSpMM regression with FP16. 0 Likes 8 Replies. yaox12. 2022-07-20. [Performance] GSpMM regression with FP16. 0 Likes 8 Replies.
Read more >Deep Learning GPU Performance Analysis: Mixed Precision ...
Lower precision usually means half precision which refers to 16-bit floating point (FP16) format. As the name suggests FP32 format uses 32 ...
Read more >
Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free
Top Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
@yaox12 Thank you for sharing your code. I tested it on my side (A5000, cuda11.7), but I observed different results:
I also forced the code using cusparseSpMM for FP16. Note that in FP16 format, I have to use
CUDA_R_32F
for the computation datatypecomputeType
when callingcusparseSpMM
, according to: https://docs.nvidia.com/cuda/cusparse/index.html#cusparse-generic-function-spmmStill tracking it internally.