Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

CUDA kernel error for BaguaStrategy with algorithm="async"

See original GitHub issue

🐛 Bug

To Reproduce

import os

import torch
from torch.utils.data import DataLoader, Dataset

from pytorch_lightning import LightningModule, Trainer
from pytorch_lightning.strategies import BaguaStrategy


class RandomDataset(Dataset):
    def __init__(self, size, length):
        self.len = length
        self.data = torch.randn(length, size)

    def __getitem__(self, index):
        return self.data[index]

    def __len__(self):
        return self.len


class BoringModel(LightningModule):
    def __init__(self):
        super().__init__()
        self.layer = torch.nn.Linear(32, 2)

    def forward(self, x):
        return self.layer(x)

    def training_step(self, batch, batch_idx):
        loss = self(batch).sum()
        self.log("train_loss", loss)
        return {"loss": loss}

    def configure_optimizers(self):
        return torch.optim.SGD(self.layer.parameters(), lr=0.1)


def run():
    train_data = DataLoader(RandomDataset(32, 64), batch_size=2)

    model = BoringModel()
    trainer = Trainer(
        default_root_dir=os.getcwd(),
        limit_train_batches=1,
        num_sanity_val_steps=0,
        max_epochs=1,
        enable_model_summary=False,
        accelerator="gpu",
        devices=2,
        strategy=BaguaStrategy(algorithm="async")
    )
    trainer.fit(model, train_dataloaders=train_data)


if __name__ == "__main__":
    run()

algorithm=“gradient_all_reduce”: no error algorithm=“decentralized”: no error

algorithm=“async”:

Failed: Cuda error kernels/bagua_kernels.cu:628 'no kernel image is available for execution on the device

algorithm=“bytegrad”:

Failed: Cuda error kernels/bagua_kernels.cu:285 'invalid device function'

algorithm=“low_precision_decentralized”:

Failed: Cuda error kernels/bagua_kernels.cu:597 'no kernel image is available for execution on the device'

Expected behavior

No error.

Environment

* CUDA:
        - GPU:
                - NVIDIA GeForce RTX 3090
                - NVIDIA GeForce RTX 3090
                - NVIDIA GeForce RTX 3090
                - NVIDIA GeForce RTX 3090
                - NVIDIA GeForce RTX 3090
                - NVIDIA GeForce RTX 3090
                - NVIDIA GeForce RTX 3090
                - NVIDIA GeForce RTX 3090
        - available:         True
        - version:           11.3
* Packages:
        - numpy:             1.21.2
        - pyTorch_debug:     False
        - pyTorch_version:   1.11.0
        - pytorch-lightning: 1.7.0dev
        - tqdm:              4.62.3
* System:
        - OS:                Linux
        - architecture:
                - 64bit
                - ELF
        - processor:         x86_64
        - python:            3.9.7
        - version:           #64-Ubuntu SMP Wed Dec 9 08:16:25 UTC 2020

Additional context

Installed bagua-cuda111

nvcc --version 

nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2020 NVIDIA Corporation
Built on Mon_Oct_12_20:09:46_PDT_2020
Cuda compilation tools, release 11.1, V11.1.105
Build cuda_11.1.TC455_06.29190527_0

cc @awaelchli @wangraying @akihironitta

Issue Analytics

State:
Created a year ago
Reactions:2
Comments:11 (6 by maintainers)

Top GitHub Comments

1reaction

quancscommented, Jul 11, 2022

@wangraying Thank you for your advice ^_^. I failed to build bagua on my local machine ubuntu 22.04. but in docker it’s OK. My docker file is posted below (for anyone who need it).

FROM pytorch/pytorch:1.12.0-cuda11.3-cudnn8-devel
RUN apt update && apt install gcc curl -y
# the requirements of my project
RUN pip install jsonargparse[signatures,urls] pesq torchmetrics[audio] omegaconf pytorch-lightning rich soundfile pandas torchdata mypy yapf

# config bagua
# if githubusercontent is unavailable (e.g. China), download it first. Then copy it to the dockerfile folder
# COPY ./install.sh /root
# RUN bash /root/install.sh
RUN curl -Ls https://raw.githubusercontent.com/BaguaSys/bagua/master/install.sh | bash
RUN python -c "import bagua_core;bagua_core.install_deps()"

1reaction

wangrayingcommented, Jul 11, 2022

@quancs It seems you are using CUDA 11.6 on your working node and Pytorch. bagua-cuda113 is compiled under CUDA 11.3. We currently does not support pre-compiled packages for CUDA11.6.

You may install bagua manually follow the tutorials here.

Top Results From Across the Web

Cuda error: no kernel image is available for execution on the ...

Bug Hi, torch.cuda.is_available() returns True, however I cannot use cuda tensor. I tried to uninstall and install anaconda, nvidia drivers ...

Cuda Error (209): cudaLaunchKernel returned ...

The error here comes about due to the fact that a CUDA kernel must be compiled in a way that the resulting code...

CUDA Runtime API - error - NVIDIA Documentation Center

This section describes the error handling functions of the CUDA runtime application programming interface. Functions. __host__ __device__ const char* ...

Why do I receive the "CUDA_ERROR_LAUNCH_TIMEOUT ...

This error occurs when a gpuArray operation or a CUDA kernel code runs for a long time on a GPU that is used...

Cuda kernel error - New to Julia

Hi, I am learning to use GPU in Julia. From the limited tutorials, I notice that it is possible to write cuda kernels...