Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

CUDA error: no kernel image is available for execution on the device

See original GitHub issue

Hi, mate

I had some issue when running a customised script on my own training dataset.

The environment is build based on Dockerfile provided in the repo. I only add a ssh server to the docker file and nothing else is changed. I think I may miss something, making It crashes at the ROI alignment.

The details are attached below:

To Reproduce

code I wrote:

(pretty much copied from colab notebook)

from detectron2.engine import DefaultTrainer
from detectron2.config import get_cfg

cfg = get_cfg()
cfg.merge_from_file("/detectron2_repo/configs/COCO-InstanceSegmentation/mask_rcnn_R_50_FPN_3x.yaml")
cfg.DATASETS.TRAIN = ("dental/train",)
cfg.DATASETS.TEST = ("dental/eval")   # no metrics implemented for this dataset
cfg.DATALOADER.NUM_WORKERS = 1
cfg.MODEL.WEIGHTS = "detectron2://COCO-InstanceSegmentation/mask_rcnn_R_50_FPN_3x/137849600/model_final_f10217.pkl"
# initialize from model zoo
cfg.SOLVER.IMS_PER_BATCH = 2
cfg.SOLVER.BASE_LR = 0.00025
cfg.SOLVER.MAX_ITER = 300    # 300 iterations seems good enough, but you can certainly train longer
cfg.MODEL.ROI_HEADS.BATCH_SIZE_PER_IMAGE = 128   # faster, and good enough for this toy dataset
cfg.MODEL.ROI_HEADS.NUM_CLASSES = 1  # only has one class (balloon)

os.makedirs(cfg.OUTPUT_DIR, exist_ok=True)
trainer = DefaultTrainer(cfg)
trainer.resume_or_load(resume=False)
trainer.train()

Error log:

Failed to load OpenCL runtime

Category ids in annotations are not in [1, #categories]! We'll apply a mapping for you.

Metadata(evaluator_type='coco', image_root='/root/dentalpoc/data/raw', json_file='/root/dentalpoc/data/coco_format/@ 2019-11-06 04.37.49 UTC/dental_train.json', name='dental/train', thing_classes=['decay', 'debris', 'restoration', 'filling', 'other issue', 'staining', 'gingivitis', 'plaque', 'gum', 'wear', 'brokentooth', 'gumrecession'], thing_dataset_id_to_contiguous_id={0: 0, 1: 1, 2: 2, 3: 3, 4: 4, 5: 5, 6: 6, 7: 7, 8: 8, 9: 9, 10: 10, 11: 11})
Config '/detectron2_repo/configs/COCO-InstanceSegmentation/mask_rcnn_R_50_FPN_3x.yaml' has no VERSION. Assuming it to be compatible with latest v2.

Category ids in annotations are not in [1, #categories]! We'll apply a mapping for you.

Traceback (most recent call last):
  File "/root/DT/src/pipelines/train/DT.py", line 78, in <module>
    trainer.train()
  File "/detectron2_repo/detectron2/engine/defaults.py", line 350, in train
    super().train(self.start_iter, self.max_iter)
  File "/detectron2_repo/detectron2/engine/train_loop.py", line 132, in train
    self.run_step()
  File "/detectron2_repo/detectron2/engine/train_loop.py", line 212, in run_step
    loss_dict = self.model(data)
  File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 541, in __call__
    result = self.forward(*input, **kwargs)
  File "/detectron2_repo/detectron2/modeling/meta_arch/rcnn.py", line 88, in forward
    _, detector_losses = self.roi_heads(images, features, proposals, gt_instances)
  File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 541, in __call__
    result = self.forward(*input, **kwargs)
  File "/detectron2_repo/detectron2/modeling/roi_heads/roi_heads.py", line 561, in forward
    losses = self._forward_box(features_list, proposals)
  File "/detectron2_repo/detectron2/modeling/roi_heads/roi_heads.py", line 615, in _forward_box
    box_features = self.box_pooler(features, [x.proposal_boxes for x in proposals])
  File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 541, in __call__
    result = self.forward(*input, **kwargs)
  File "/detectron2_repo/detectron2/modeling/poolers.py", line 208, in forward
    output[inds] = pooler(x_level, pooler_fmt_boxes_level)
  File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 541, in __call__
    result = self.forward(*input, **kwargs)
  File "/detectron2_repo/detectron2/layers/roi_align.py", line 95, in forward
    input, rois, self.output_size, self.spatial_scale, self.sampling_ratio, self.aligned
  File "/detectron2_repo/detectron2/layers/roi_align.py", line 20, in forward
    input, roi, spatial_scale, output_size[0], output_size[1], sampling_ratio, aligned
RuntimeError: CUDA error: no kernel image is available for execution on the device **(ROIAlign_forward_cuda at /detectron2_repo/detectron2/layers/csrc/ROIAlign/ROIAlign_cuda.cu:361)**
frame #0: c10::Error::Error(c10::SourceLocation, std::string const&) + 0x33 (0x7f3b3fce5813 in /usr/local/lib/python3.6/dist-packages/torch/lib/libc10.so)
frame #1: detectron2::ROIAlign_forward_cuda(at::Tensor const&, at::Tensor const&, float, int, int, int, bool) + 0xa24 (0x7f3b3e44e556 in /detectron2_repo/detectron2/_C.cpython-36m-x86_64-linux-gnu.so)
frame #2: detectron2::ROIAlign_forward(at::Tensor const&, at::Tensor const&, float, int, int, int, bool) + 0xb6 (0x7f3b3e3d21b6 in /detectron2_repo/detectron2/_C.cpython-36m-x86_64-linux-gnu.so)
frame #3: <unknown function> + 0x42cfb (0x7f3b3e3e3cfb in /detectron2_repo/detectron2/_C.cpython-36m-x86_64-linux-gnu.so)
frame #4: <unknown function> + 0x3bfe0 (0x7f3b3e3dcfe0 in /detectron2_repo/detectron2/_C.cpython-36m-x86_64-linux-gnu.so)
frame #5: /usr/bin/python3() [0x50abc5]
frame #6: _PyEval_EvalFrameDefault + 0x449 (0x50c549 in /usr/bin/python3)
frame #7: /usr/bin/python3() [0x5081d5]
frame #8: /usr/bin/python3() [0x58952b]
frame #9: PyObject_Call + 0x3e (0x5a04ce in /usr/bin/python3)
frame #10: THPFunction_apply(_object*, _object*) + 0xa4f (0x7f3b8b37e4af in /usr/local/lib/python3.6/dist-packages/torch/lib/libtorch_python.so)
frame #11: /usr/bin/python3() [0x50a84f]
frame #12: _PyEval_EvalFrameDefault + 0x449 (0x50c549 in /usr/bin/python3)
frame #13: _PyFunction_FastCallDict + 0xf5 (0x5093e5 in /usr/bin/python3)
frame #14: /usr/bin/python3() [0x5951c1]
frame #15: PyObject_Call + 0x3e (0x5a04ce in /usr/bin/python3)
frame #16: _PyEval_EvalFrameDefault + 0x17f5 (0x50d8f5 in /usr/bin/python3)
frame #17: /usr/bin/python3() [0x5081d5]
frame #18: _PyFunction_FastCallDict + 0x2e2 (0x5095d2 in /usr/bin/python3)
frame #19: /usr/bin/python3() [0x5951c1]
frame #20: /usr/bin/python3() [0x54ac01]
frame #21: _PyObject_FastCallKeywords + 0x19c (0x5aa69c in /usr/bin/python3)
frame #22: /usr/bin/python3() [0x50ab53]
frame #23: _PyEval_EvalFrameDefault + 0x449 (0x50c549 in /usr/bin/python3)
frame #24: _PyFunction_FastCallDict + 0xf5 (0x5093e5 in /usr/bin/python3)
frame #25: /usr/bin/python3() [0x5951c1]
frame #26: PyObject_Call + 0x3e (0x5a04ce in /usr/bin/python3)
frame #27: _PyEval_EvalFrameDefault + 0x17f5 (0x50d8f5 in /usr/bin/python3)
frame #28: /usr/bin/python3() [0x5081d5]
frame #29: _PyFunction_FastCallDict + 0x2e2 (0x5095d2 in /usr/bin/python3)
frame #30: /usr/bin/python3() [0x5951c1]
frame #31: /usr/bin/python3() [0x54ac01]
frame #32: _PyObject_FastCallKeywords + 0x19c (0x5aa69c in /usr/bin/python3)
frame #33: /usr/bin/python3() [0x50ab53]
frame #34: _PyEval_EvalFrameDefault + 0x449 (0x50c549 in /usr/bin/python3)
frame #35: /usr/bin/python3() [0x509ce8]
frame #36: /usr/bin/python3() [0x50aa1d]
frame #37: _PyEval_EvalFrameDefault + 0x449 (0x50c549 in /usr/bin/python3)
frame #38: /usr/bin/python3() [0x5081d5]
frame #39: _PyFunction_FastCallDict + 0x2e2 (0x5095d2 in /usr/bin/python3)
frame #40: /usr/bin/python3() [0x5951c1]
frame #41: PyObject_Call + 0x3e (0x5a04ce in /usr/bin/python3)
frame #42: _PyEval_EvalFrameDefault + 0x17f5 (0x50d8f5 in /usr/bin/python3)
frame #43: /usr/bin/python3() [0x5081d5]
frame #44: _PyFunction_FastCallDict + 0x2e2 (0x5095d2 in /usr/bin/python3)
frame #45: /usr/bin/python3() [0x5951c1]
frame #46: /usr/bin/python3() [0x54ac01]
frame #47: _PyObject_FastCallKeywords + 0x19c (0x5aa69c in /usr/bin/python3)
frame #48: /usr/bin/python3() [0x50ab53]
frame #49: _PyEval_EvalFrameDefault + 0x449 (0x50c549 in /usr/bin/python3)
frame #50: /usr/bin/python3() [0x5081d5]
frame #51: _PyFunction_FastCallDict + 0x2e2 (0x5095d2 in /usr/bin/python3)
frame #52: /usr/bin/python3() [0x5951c1]
frame #53: PyObject_Call + 0x3e (0x5a04ce in /usr/bin/python3)
frame #54: _PyEval_EvalFrameDefault + 0x17f5 (0x50d8f5 in /usr/bin/python3)
frame #55: /usr/bin/python3() [0x5081d5]
frame #56: _PyFunction_FastCallDict + 0x2e2 (0x5095d2 in /usr/bin/python3)
frame #57: /usr/bin/python3() [0x5951c1]
frame #58: /usr/bin/python3() [0x54ac01]
frame #59: _PyObject_FastCallKeywords + 0x19c (0x5aa69c in /usr/bin/python3)
frame #60: /usr/bin/python3() [0x50ab53]
frame #61: _PyEval_EvalFrameDefault + 0x449 (0x50c549 in /usr/bin/python3)
frame #62: /usr/bin/python3() [0x509ce8]
frame #63: /usr/bin/python3() [0x50aa1d]


Process finished with exit code 1

Environment

A docker container on AWS DEEP learning AMI image

My dockerfile is modified based on the dockerfile provided in the repo

FROM nvidia/cuda:10.1-cudnn7-devel
# To use this Dockerfile:
# 1. `nvidia-docker build -t detectron2:v0 .`
# 2. `nvidia-docker run -it --name detectron2 detectron2:v0`

################### env and args #################

ENV DEBIAN_FRONTEND noninteractive
ARG user
ARG password

################# following are from detectron official repo ############

RUN apt-get update && apt-get install -y \
	libpng-dev libjpeg-dev python3-opencv ca-certificates \
	python3-dev build-essential pkg-config git curl wget automake libtool && \
  rm -rf /var/lib/apt/lists/*

RUN curl -fSsL -O https://bootstrap.pypa.io/get-pip.py && \
	python3 get-pip.py && \
	rm get-pip.py

# install dependencies
# See https://pytorch.org/ for other options if you use a different version of CUDA

# old version pytorch
# RUN pip install torch==1.2.0 torchvision==0.4.0 -f https://download.pytorch.org/whl/torch_stable.html
RUN pip install torch torchvision cython \
	'git+https://github.com/facebookresearch/fvcore'
RUN pip install 'git+https://github.com/cocodataset/cocoapi.git#subdirectory=PythonAPI'

# install detectron2
RUN git clone https://github.com/facebookresearch/detectron2 /detectron2_repo
ENV FORCE_CUDA="1"
ENV TORCH_CUDA_ARCH_LIST="Maxwell;Maxwell+Tegra;Pascal;Volta;Turing"
RUN pip install -e /detectron2_repo

# install openssh server
RUN apt-get update && apt-get install -y openssh-server
RUN mkdir /var/run/sshd
RUN echo "$user:$password" | chpasswd
RUN sed -i 's/PermitRootLogin prohibit-password/PermitRootLogin yes/' /etc/ssh/sshd_config

RUN echo "PermitRootLogin yes" >> /etc/ssh/sshd_config
#RUN echo "prohibit-password/PermitRootLogin yes" >> /etc/ssh/sshd_config
#RUN echo "PubkeyAuthentication yes" >> /etc/ssh/sshd_config

# SSH login fix. Otherwise user is kicked off after login
RUN sed 's@session\s*required\s*pam_loginuid.so@session optional pam_loginuid.so@g' -i /etc/pam.d/sshd

ENV NOTVISIBLE "in users profile"
RUN echo "export VISIBLE=now" >> /etc/profile

RUN apt-get update && apt-get install -y tmux

#install extra requirements
COPY requirements.txt requirements.txt
RUN pip3 install -r requirements.txt

# ready to go!
WORKDIR /detectron2_repo

EXPOSE 22
EXPOSE 6006
EXPOSE 8888
EXPOSE 5000
CMD ["/usr/sbin/sshd", "-D"]

Issue Analytics

State:
Created 4 years ago
Comments:5 (1 by maintainers)

Top GitHub Comments

2reactions

vkhalidovcommented, Nov 6, 2019

@ruodingt your Dockerfile specifies: ENV TORCH_CUDA_ARCH_LIST="Maxwell;Maxwell+Tegra;Pascal;Volta;Turing" However, your graphics card is: GPU 0 Tesla K80 which is Kepler. Did you try to specify smth like: ENV TORCH_CUDA_ARCH_LIST="Kepler;Kepler+Tesla;Maxwell;Maxwell+Tegra;Pascal;Volta;Turing"?

0reactions

ruodingtcommented, Nov 7, 2019

@vkhalidov @ppwwyyxx Thank you. It works after I change the the ENV.