question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

[Performance] GSpMM regression with FP16

See original GitHub issue

I spent some time profiling the GAT example with AMP in https://docs.dgl.ai/en/0.9.x/guide/mixed_precision.html and want to know why we didn’t obtain performance gain from FP16. I observed regression in both the forward and backward phases.

Performance on A100

AMP FP32
forward 254 ms 160 ms
backward 482 ms 369 ms
step 738 ms 529 ms

Take the forward phase for example, there are three EdgeSoftmax-GSpMM pairs.

  • AMP image
  • FP32 image

Elapsed times (ms) of the EdgeSoftmax-GSpMM pairs x3.

AMP FP32
EdgeSoftmax1 20.896 22.055
GSpMM1 73.851 31.890
EdgeSoftmax2 20.865 22.080
GSpMM2 73.797 31.882
EdgeSoftmax3 20.869 22.154
GSpMM3 27.528 13.422

We can see that the EdgeSoftmax is a bit faster with AMP while it’s GSpMM that results in the regression.

Dig it a bit deeper, GSpMM with FP32 will go through the cusparse::csrmm code path while adopting dgl::aten::cuda::SpMMCsrKernel with FP16 inputs. I’ll look into the performance issue on cusparse with FP16.

ENV

  • CUDA_VERSION: 11.7.0.022
  • CUDA_DRIVER_VERSION: 515.43.04
  • CUSPARSE_VERSION: 11.7.3.50
  • GPU: NVIDIA A100 80GB PCIe
  • PyTorch: 1.12.0a0+8a1a93a

Issue Analytics

  • State:open
  • Created a year ago
  • Comments:10 (4 by maintainers)

github_iconTop GitHub Comments

1reaction
chang-lcommented, Aug 10, 2022

@yaox12 Thank you for sharing your code. I tested it on my side (A5000, cuda11.7), but I observed different results:

feat dim fp32 (ms) fp16 (ms) Speed-up of fp16
32 17.6 18.75 0.938666667
64 21.35 22.1 0.966063348
128 40.46 28.7 1.409756098
256 78.24 55.59 1.407447383
512 156.18 100.48 1.554339172
602 203.78 140.86 1.446684651

I also forced the code using cusparseSpMM for FP16. Note that in FP16 format, I have to use CUDA_R_32F for the computation datatype computeType when calling cusparseSpMM, according to: https://docs.nvidia.com/cuda/cusparse/index.html#cusparse-generic-function-spmm

0reactions
yaox12commented, Oct 24, 2022

Still tracking it internally.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Train With Mixed Precision - NVIDIA Documentation Center
To achieve optimum performance, you need to train a model using Tensor Core math and FP16 mode on Microsoft Cognitive Toolkit. 7.5.1. Running ......
Read more >
CogDL: A Toolkit for Deep Learning on Graphs - arXiv
TABLE 3: Performance of mixed precision training. Memory. Accuracy. Training Speed. 2080 Ti. 3090 w/o fp16. 5,567 MB.
Read more >
What Every User Should Know About Mixed Precision ...
The performance gain of mixed precision training can depend on multiple factors (e.g. compute-bound vs memory-bound problems) and users should ...
Read more >
yzh119 (yzh119) - PullAnswer
[Performance] GSpMM regression with FP16. 0 Likes 8 Replies. yaox12. 2022-07-20. [Performance] GSpMM regression with FP16. 0 Likes 8 Replies.
Read more >
Deep Learning GPU Performance Analysis: Mixed Precision ...
Lower precision usually means half precision which refers to 16-bit floating point (FP16) format. As the name suggests FP32 format uses 32 ...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found