Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

[Performance] GSpMM regression with FP16

See original GitHub issue

I spent some time profiling the GAT example with AMP in https://docs.dgl.ai/en/0.9.x/guide/mixed_precision.html and want to know why we didn’t obtain performance gain from FP16. I observed regression in both the forward and backward phases.

Performance on A100

	AMP	FP32
forward	254 ms	160 ms
backward	482 ms	369 ms
step	738 ms	529 ms

Take the forward phase for example, there are three EdgeSoftmax-GSpMM pairs.

AMP
FP32

Elapsed times (ms) of the EdgeSoftmax-GSpMM pairs x3.

	AMP	FP32
EdgeSoftmax1	20.896	22.055
GSpMM1	73.851	31.890
EdgeSoftmax2	20.865	22.080
GSpMM2	73.797	31.882
EdgeSoftmax3	20.869	22.154
GSpMM3	27.528	13.422

We can see that the EdgeSoftmax is a bit faster with AMP while it’s GSpMM that results in the regression.

Dig it a bit deeper, GSpMM with FP32 will go through the cusparse::csrmm code path while adopting dgl::aten::cuda::SpMMCsrKernel with FP16 inputs. I’ll look into the performance issue on cusparse with FP16.

ENV

CUDA_VERSION: 11.7.0.022
CUDA_DRIVER_VERSION: 515.43.04
CUSPARSE_VERSION: 11.7.3.50
GPU: NVIDIA A100 80GB PCIe
PyTorch: 1.12.0a0+8a1a93a

Issue Analytics

State:
Created a year ago
Comments:10 (4 by maintainers)

Top GitHub Comments

1reaction

chang-lcommented, Aug 10, 2022

@yaox12 Thank you for sharing your code. I tested it on my side (A5000, cuda11.7), but I observed different results:

feat dim	fp32 (ms)	fp16 (ms)	Speed-up of fp16
32	17.6	18.75	0.938666667
64	21.35	22.1	0.966063348
128	40.46	28.7	1.409756098
256	78.24	55.59	1.407447383
512	156.18	100.48	1.554339172
602	203.78	140.86	1.446684651

I also forced the code using cusparseSpMM for FP16. Note that in FP16 format, I have to use CUDA_R_32F for the computation datatype computeType when calling cusparseSpMM, according to: https://docs.nvidia.com/cuda/cusparse/index.html#cusparse-generic-function-spmm

0reactions

yaox12commented, Oct 24, 2022

Still tracking it internally.