Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

CUDA compilation error with Ctx Length>2000

See original GitHub issue

Hello, I am trying out RWKV with audio modality and when I set T_MAX>>1000, it throws this error:

Emitting ninja build file /root/.cache/torch_extensions/py39_cu116/timex/build.ninja...
Building extension module timex...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
[1/2] /usr/local/cuda/bin/nvcc  -DTORCH_EXTENSION_NAME=timex -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1013\" -isystem /root/anaconda3/envs/surya-env/lib/python3.9/site-packages/torch/include -isystem /root/anaconda3/envs/surya-env/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -isystem /root/anaconda3/envs/surya-env/lib/python3.9/site-packages/torch/include/TH -isystem /root/anaconda3/envs/surya-env/lib/python3.9/site-packages/torch/include/THC -isystem /usr/local/cuda/include -isystem /root/anaconda3/envs/surya-env/include/python3.9 -D_GLIBCXX_USE_CXX11_ABI=0 -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr -gencode=arch=compute_80,code=compute_80 -gencode=arch=compute_80,code=sm_80 --compiler-options '-fPIC' --use_fast_math --extra-device-vectorization -DTmax=10000 -DBF=8 -DBB=2 -std=c++14 -c cuda/timex_cuda.cu -o timex_cuda.cuda.o 
FAILED: timex_cuda.cuda.o 
/usr/local/cuda/bin/nvcc  -DTORCH_EXTENSION_NAME=timex -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1013\" -isystem /root/anaconda3/envs/surya-env/lib/python3.9/site-packages/torch/include -isystem /root/anaconda3/envs/surya-env/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -isystem /root/anaconda3/envs/surya-env/lib/python3.9/site-packages/torch/include/TH -isystem /root/anaconda3/envs/surya-env/lib/python3.9/site-packages/torch/include/THC -isystem /usr/local/cuda/include -isystem /root/anaconda3/envs/surya-env/include/python3.9 -D_GLIBCXX_USE_CXX11_ABI=0 -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr -gencode=arch=compute_80,code=compute_80 -gencode=arch=compute_80,code=sm_80 --compiler-options '-fPIC' --use_fast_math --extra-device-vectorization -DTmax=10000 -DBF=8 -DBB=2 -std=c++14 -c cuda/timex_cuda.cu -o timex_cuda.cuda.o 
ptxas error   : Entry function '_Z15kernel_backwardIfEvPKT_S2_S2_PS0_S3_iii' uses too much shared data (0x30d40 bytes, 0xc000 max)
ptxas error   : Entry function '_Z14kernel_forwardIfEvPKT_S2_PS0_S0_iii' uses too much shared data (0x57e40 bytes, 0xc000 max)
ninja: build stopped: subcommand failed.

GPU: A100, VRAM: 42GB, CUDA 11.6

I am okay if the training takes a bit long. But I need this to work. Don’t know any CUDA. Can you suggest some workarounds?

Thanks for the incredible work btw!

Issue Analytics

State:
Created a year ago
Comments:8 (5 by maintainers)

Top GitHub Comments

1reaction

BlinkDLcommented, Jul 9, 2022

@BlinkDL Can you please point out where we need to make a change to the code to reduce the tensor element from 4 bytes to 2 bytes? Thanks a lot!

And the current design will overflow under FP16 😃 Wait for my new kernels.

0reactions

BlinkDLcommented, Aug 20, 2022

Now the new RWKV-4 can compile ctxlen=4096 kernels 😃

Top Results From Across the Web

[ compilation issue] gpu_autodiff compilation error on ... - GitHub

Summary Hi, I have compilation issues with the following system settings: Platform: Windows 10, CUDA 11.1 Compiler: Cmake 3.21.1.0, ...

CUDA Python 12.0.0 documentation - GitHub Pages

While executing a kernel, the device encountered a load or store instruction on a memory address which is not aligned. This leaves the...

Transitioning from CUDA to HIP - AMD Documentation - Portal

Once the CUDA code is ported to HIP and is running on the CUDA machine, compile the ... Will cause compile error: #define...

Can't run RPC GPU tutorial on my own device - Questions

I'm looking at using RPC to cross compile and run on a Jetson TX2. ... connection to my device, and cuda I keep...

CUDA Compiler Driver NVCC - NVIDIA Documentation Center

The documentation for nvcc, the CUDA compiler driver. ... Using an unsupported host compiler may cause compilation failure or incorrect run time execution....