CUDA kernel error for BaguaStrategy with algorithm="async"
See original GitHub issue🐛 Bug
To Reproduce
import os
import torch
from torch.utils.data import DataLoader, Dataset
from pytorch_lightning import LightningModule, Trainer
from pytorch_lightning.strategies import BaguaStrategy
class RandomDataset(Dataset):
def __init__(self, size, length):
self.len = length
self.data = torch.randn(length, size)
def __getitem__(self, index):
return self.data[index]
def __len__(self):
return self.len
class BoringModel(LightningModule):
def __init__(self):
super().__init__()
self.layer = torch.nn.Linear(32, 2)
def forward(self, x):
return self.layer(x)
def training_step(self, batch, batch_idx):
loss = self(batch).sum()
self.log("train_loss", loss)
return {"loss": loss}
def configure_optimizers(self):
return torch.optim.SGD(self.layer.parameters(), lr=0.1)
def run():
train_data = DataLoader(RandomDataset(32, 64), batch_size=2)
model = BoringModel()
trainer = Trainer(
default_root_dir=os.getcwd(),
limit_train_batches=1,
num_sanity_val_steps=0,
max_epochs=1,
enable_model_summary=False,
accelerator="gpu",
devices=2,
strategy=BaguaStrategy(algorithm="async")
)
trainer.fit(model, train_dataloaders=train_data)
if __name__ == "__main__":
run()
algorithm=“gradient_all_reduce”: no error algorithm=“decentralized”: no error
algorithm=“async”:
Failed: Cuda error kernels/bagua_kernels.cu:628 'no kernel image is available for execution on the device
algorithm=“bytegrad”:
Failed: Cuda error kernels/bagua_kernels.cu:285 'invalid device function'
algorithm=“low_precision_decentralized”:
Failed: Cuda error kernels/bagua_kernels.cu:597 'no kernel image is available for execution on the device'
Expected behavior
No error.
Environment
* CUDA:
- GPU:
- NVIDIA GeForce RTX 3090
- NVIDIA GeForce RTX 3090
- NVIDIA GeForce RTX 3090
- NVIDIA GeForce RTX 3090
- NVIDIA GeForce RTX 3090
- NVIDIA GeForce RTX 3090
- NVIDIA GeForce RTX 3090
- NVIDIA GeForce RTX 3090
- available: True
- version: 11.3
* Packages:
- numpy: 1.21.2
- pyTorch_debug: False
- pyTorch_version: 1.11.0
- pytorch-lightning: 1.7.0dev
- tqdm: 4.62.3
* System:
- OS: Linux
- architecture:
- 64bit
- ELF
- processor: x86_64
- python: 3.9.7
- version: #64-Ubuntu SMP Wed Dec 9 08:16:25 UTC 2020
Additional context
Installed bagua-cuda111
nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2020 NVIDIA Corporation
Built on Mon_Oct_12_20:09:46_PDT_2020
Cuda compilation tools, release 11.1, V11.1.105
Build cuda_11.1.TC455_06.29190527_0
Issue Analytics
- State:
- Created a year ago
- Reactions:2
- Comments:11 (6 by maintainers)
Top Results From Across the Web
Cuda error: no kernel image is available for execution on the ...
Bug Hi, torch.cuda.is_available() returns True, however I cannot use cuda tensor. I tried to uninstall and install anaconda, nvidia drivers ...
Read more >Cuda Error (209): cudaLaunchKernel returned ...
The error here comes about due to the fact that a CUDA kernel must be compiled in a way that the resulting code...
Read more >CUDA Runtime API - error - NVIDIA Documentation Center
This section describes the error handling functions of the CUDA runtime application programming interface. Functions. __host__ __device__ const char* ...
Read more >Why do I receive the "CUDA_ERROR_LAUNCH_TIMEOUT ...
This error occurs when a gpuArray operation or a CUDA kernel code runs for a long time on a GPU that is used...
Read more >Cuda kernel error - New to Julia
Hi, I am learning to use GPU in Julia. From the limited tutorials, I notice that it is possible to write cuda kernels...
Read more >
Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free
Top Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
@wangraying Thank you for your advice ^_^. I failed to build
bagua
on my local machine ubuntu 22.04. but in docker it’s OK. My docker file is posted below (for anyone who need it).@quancs It seems you are using CUDA 11.6 on your working node and Pytorch.
bagua-cuda113
is compiled under CUDA 11.3. We currently does not support pre-compiled packages for CUDA11.6.You may install bagua manually follow the tutorials here.