CUDA error: no kernel image is available for execution on the device
See original GitHub issueHi, mate
I had some issue when running a customised script on my own training dataset.
The environment is build based on Dockerfile provided in the repo. I only add a ssh server to the docker file and nothing else is changed. I think I may miss something, making It crashes at the ROI alignment.
The details are attached below:
To Reproduce
code I wrote:
(pretty much copied from colab notebook)
from detectron2.engine import DefaultTrainer
from detectron2.config import get_cfg
cfg = get_cfg()
cfg.merge_from_file("/detectron2_repo/configs/COCO-InstanceSegmentation/mask_rcnn_R_50_FPN_3x.yaml")
cfg.DATASETS.TRAIN = ("dental/train",)
cfg.DATASETS.TEST = ("dental/eval") # no metrics implemented for this dataset
cfg.DATALOADER.NUM_WORKERS = 1
cfg.MODEL.WEIGHTS = "detectron2://COCO-InstanceSegmentation/mask_rcnn_R_50_FPN_3x/137849600/model_final_f10217.pkl"
# initialize from model zoo
cfg.SOLVER.IMS_PER_BATCH = 2
cfg.SOLVER.BASE_LR = 0.00025
cfg.SOLVER.MAX_ITER = 300 # 300 iterations seems good enough, but you can certainly train longer
cfg.MODEL.ROI_HEADS.BATCH_SIZE_PER_IMAGE = 128 # faster, and good enough for this toy dataset
cfg.MODEL.ROI_HEADS.NUM_CLASSES = 1 # only has one class (balloon)
os.makedirs(cfg.OUTPUT_DIR, exist_ok=True)
trainer = DefaultTrainer(cfg)
trainer.resume_or_load(resume=False)
trainer.train()
Error log:
Failed to load OpenCL runtime
Category ids in annotations are not in [1, #categories]! We'll apply a mapping for you.
Metadata(evaluator_type='coco', image_root='/root/dentalpoc/data/raw', json_file='/root/dentalpoc/data/coco_format/@ 2019-11-06 04.37.49 UTC/dental_train.json', name='dental/train', thing_classes=['decay', 'debris', 'restoration', 'filling', 'other issue', 'staining', 'gingivitis', 'plaque', 'gum', 'wear', 'brokentooth', 'gumrecession'], thing_dataset_id_to_contiguous_id={0: 0, 1: 1, 2: 2, 3: 3, 4: 4, 5: 5, 6: 6, 7: 7, 8: 8, 9: 9, 10: 10, 11: 11})
Config '/detectron2_repo/configs/COCO-InstanceSegmentation/mask_rcnn_R_50_FPN_3x.yaml' has no VERSION. Assuming it to be compatible with latest v2.
Category ids in annotations are not in [1, #categories]! We'll apply a mapping for you.
Traceback (most recent call last):
File "/root/DT/src/pipelines/train/DT.py", line 78, in <module>
trainer.train()
File "/detectron2_repo/detectron2/engine/defaults.py", line 350, in train
super().train(self.start_iter, self.max_iter)
File "/detectron2_repo/detectron2/engine/train_loop.py", line 132, in train
self.run_step()
File "/detectron2_repo/detectron2/engine/train_loop.py", line 212, in run_step
loss_dict = self.model(data)
File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 541, in __call__
result = self.forward(*input, **kwargs)
File "/detectron2_repo/detectron2/modeling/meta_arch/rcnn.py", line 88, in forward
_, detector_losses = self.roi_heads(images, features, proposals, gt_instances)
File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 541, in __call__
result = self.forward(*input, **kwargs)
File "/detectron2_repo/detectron2/modeling/roi_heads/roi_heads.py", line 561, in forward
losses = self._forward_box(features_list, proposals)
File "/detectron2_repo/detectron2/modeling/roi_heads/roi_heads.py", line 615, in _forward_box
box_features = self.box_pooler(features, [x.proposal_boxes for x in proposals])
File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 541, in __call__
result = self.forward(*input, **kwargs)
File "/detectron2_repo/detectron2/modeling/poolers.py", line 208, in forward
output[inds] = pooler(x_level, pooler_fmt_boxes_level)
File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 541, in __call__
result = self.forward(*input, **kwargs)
File "/detectron2_repo/detectron2/layers/roi_align.py", line 95, in forward
input, rois, self.output_size, self.spatial_scale, self.sampling_ratio, self.aligned
File "/detectron2_repo/detectron2/layers/roi_align.py", line 20, in forward
input, roi, spatial_scale, output_size[0], output_size[1], sampling_ratio, aligned
RuntimeError: CUDA error: no kernel image is available for execution on the device **(ROIAlign_forward_cuda at /detectron2_repo/detectron2/layers/csrc/ROIAlign/ROIAlign_cuda.cu:361)**
frame #0: c10::Error::Error(c10::SourceLocation, std::string const&) + 0x33 (0x7f3b3fce5813 in /usr/local/lib/python3.6/dist-packages/torch/lib/libc10.so)
frame #1: detectron2::ROIAlign_forward_cuda(at::Tensor const&, at::Tensor const&, float, int, int, int, bool) + 0xa24 (0x7f3b3e44e556 in /detectron2_repo/detectron2/_C.cpython-36m-x86_64-linux-gnu.so)
frame #2: detectron2::ROIAlign_forward(at::Tensor const&, at::Tensor const&, float, int, int, int, bool) + 0xb6 (0x7f3b3e3d21b6 in /detectron2_repo/detectron2/_C.cpython-36m-x86_64-linux-gnu.so)
frame #3: <unknown function> + 0x42cfb (0x7f3b3e3e3cfb in /detectron2_repo/detectron2/_C.cpython-36m-x86_64-linux-gnu.so)
frame #4: <unknown function> + 0x3bfe0 (0x7f3b3e3dcfe0 in /detectron2_repo/detectron2/_C.cpython-36m-x86_64-linux-gnu.so)
frame #5: /usr/bin/python3() [0x50abc5]
frame #6: _PyEval_EvalFrameDefault + 0x449 (0x50c549 in /usr/bin/python3)
frame #7: /usr/bin/python3() [0x5081d5]
frame #8: /usr/bin/python3() [0x58952b]
frame #9: PyObject_Call + 0x3e (0x5a04ce in /usr/bin/python3)
frame #10: THPFunction_apply(_object*, _object*) + 0xa4f (0x7f3b8b37e4af in /usr/local/lib/python3.6/dist-packages/torch/lib/libtorch_python.so)
frame #11: /usr/bin/python3() [0x50a84f]
frame #12: _PyEval_EvalFrameDefault + 0x449 (0x50c549 in /usr/bin/python3)
frame #13: _PyFunction_FastCallDict + 0xf5 (0x5093e5 in /usr/bin/python3)
frame #14: /usr/bin/python3() [0x5951c1]
frame #15: PyObject_Call + 0x3e (0x5a04ce in /usr/bin/python3)
frame #16: _PyEval_EvalFrameDefault + 0x17f5 (0x50d8f5 in /usr/bin/python3)
frame #17: /usr/bin/python3() [0x5081d5]
frame #18: _PyFunction_FastCallDict + 0x2e2 (0x5095d2 in /usr/bin/python3)
frame #19: /usr/bin/python3() [0x5951c1]
frame #20: /usr/bin/python3() [0x54ac01]
frame #21: _PyObject_FastCallKeywords + 0x19c (0x5aa69c in /usr/bin/python3)
frame #22: /usr/bin/python3() [0x50ab53]
frame #23: _PyEval_EvalFrameDefault + 0x449 (0x50c549 in /usr/bin/python3)
frame #24: _PyFunction_FastCallDict + 0xf5 (0x5093e5 in /usr/bin/python3)
frame #25: /usr/bin/python3() [0x5951c1]
frame #26: PyObject_Call + 0x3e (0x5a04ce in /usr/bin/python3)
frame #27: _PyEval_EvalFrameDefault + 0x17f5 (0x50d8f5 in /usr/bin/python3)
frame #28: /usr/bin/python3() [0x5081d5]
frame #29: _PyFunction_FastCallDict + 0x2e2 (0x5095d2 in /usr/bin/python3)
frame #30: /usr/bin/python3() [0x5951c1]
frame #31: /usr/bin/python3() [0x54ac01]
frame #32: _PyObject_FastCallKeywords + 0x19c (0x5aa69c in /usr/bin/python3)
frame #33: /usr/bin/python3() [0x50ab53]
frame #34: _PyEval_EvalFrameDefault + 0x449 (0x50c549 in /usr/bin/python3)
frame #35: /usr/bin/python3() [0x509ce8]
frame #36: /usr/bin/python3() [0x50aa1d]
frame #37: _PyEval_EvalFrameDefault + 0x449 (0x50c549 in /usr/bin/python3)
frame #38: /usr/bin/python3() [0x5081d5]
frame #39: _PyFunction_FastCallDict + 0x2e2 (0x5095d2 in /usr/bin/python3)
frame #40: /usr/bin/python3() [0x5951c1]
frame #41: PyObject_Call + 0x3e (0x5a04ce in /usr/bin/python3)
frame #42: _PyEval_EvalFrameDefault + 0x17f5 (0x50d8f5 in /usr/bin/python3)
frame #43: /usr/bin/python3() [0x5081d5]
frame #44: _PyFunction_FastCallDict + 0x2e2 (0x5095d2 in /usr/bin/python3)
frame #45: /usr/bin/python3() [0x5951c1]
frame #46: /usr/bin/python3() [0x54ac01]
frame #47: _PyObject_FastCallKeywords + 0x19c (0x5aa69c in /usr/bin/python3)
frame #48: /usr/bin/python3() [0x50ab53]
frame #49: _PyEval_EvalFrameDefault + 0x449 (0x50c549 in /usr/bin/python3)
frame #50: /usr/bin/python3() [0x5081d5]
frame #51: _PyFunction_FastCallDict + 0x2e2 (0x5095d2 in /usr/bin/python3)
frame #52: /usr/bin/python3() [0x5951c1]
frame #53: PyObject_Call + 0x3e (0x5a04ce in /usr/bin/python3)
frame #54: _PyEval_EvalFrameDefault + 0x17f5 (0x50d8f5 in /usr/bin/python3)
frame #55: /usr/bin/python3() [0x5081d5]
frame #56: _PyFunction_FastCallDict + 0x2e2 (0x5095d2 in /usr/bin/python3)
frame #57: /usr/bin/python3() [0x5951c1]
frame #58: /usr/bin/python3() [0x54ac01]
frame #59: _PyObject_FastCallKeywords + 0x19c (0x5aa69c in /usr/bin/python3)
frame #60: /usr/bin/python3() [0x50ab53]
frame #61: _PyEval_EvalFrameDefault + 0x449 (0x50c549 in /usr/bin/python3)
frame #62: /usr/bin/python3() [0x509ce8]
frame #63: /usr/bin/python3() [0x50aa1d]
Process finished with exit code 1
Environment
A docker container on AWS DEEP learning AMI image
My dockerfile is modified based on the dockerfile provided in the repo
FROM nvidia/cuda:10.1-cudnn7-devel
# To use this Dockerfile:
# 1. `nvidia-docker build -t detectron2:v0 .`
# 2. `nvidia-docker run -it --name detectron2 detectron2:v0`
################### env and args #################
ENV DEBIAN_FRONTEND noninteractive
ARG user
ARG password
################# following are from detectron official repo ############
RUN apt-get update && apt-get install -y \
libpng-dev libjpeg-dev python3-opencv ca-certificates \
python3-dev build-essential pkg-config git curl wget automake libtool && \
rm -rf /var/lib/apt/lists/*
RUN curl -fSsL -O https://bootstrap.pypa.io/get-pip.py && \
python3 get-pip.py && \
rm get-pip.py
# install dependencies
# See https://pytorch.org/ for other options if you use a different version of CUDA
# old version pytorch
# RUN pip install torch==1.2.0 torchvision==0.4.0 -f https://download.pytorch.org/whl/torch_stable.html
RUN pip install torch torchvision cython \
'git+https://github.com/facebookresearch/fvcore'
RUN pip install 'git+https://github.com/cocodataset/cocoapi.git#subdirectory=PythonAPI'
# install detectron2
RUN git clone https://github.com/facebookresearch/detectron2 /detectron2_repo
ENV FORCE_CUDA="1"
ENV TORCH_CUDA_ARCH_LIST="Maxwell;Maxwell+Tegra;Pascal;Volta;Turing"
RUN pip install -e /detectron2_repo
# install openssh server
RUN apt-get update && apt-get install -y openssh-server
RUN mkdir /var/run/sshd
RUN echo "$user:$password" | chpasswd
RUN sed -i 's/PermitRootLogin prohibit-password/PermitRootLogin yes/' /etc/ssh/sshd_config
RUN echo "PermitRootLogin yes" >> /etc/ssh/sshd_config
#RUN echo "prohibit-password/PermitRootLogin yes" >> /etc/ssh/sshd_config
#RUN echo "PubkeyAuthentication yes" >> /etc/ssh/sshd_config
# SSH login fix. Otherwise user is kicked off after login
RUN sed 's@session\s*required\s*pam_loginuid.so@session optional pam_loginuid.so@g' -i /etc/pam.d/sshd
ENV NOTVISIBLE "in users profile"
RUN echo "export VISIBLE=now" >> /etc/profile
RUN apt-get update && apt-get install -y tmux
#install extra requirements
COPY requirements.txt requirements.txt
RUN pip3 install -r requirements.txt
# ready to go!
WORKDIR /detectron2_repo
EXPOSE 22
EXPOSE 6006
EXPOSE 8888
EXPOSE 5000
CMD ["/usr/sbin/sshd", "-D"]
Issue Analytics
- State:
- Created 4 years ago
- Comments:5 (1 by maintainers)
Top Results From Across the Web
Cuda error: no kernel image is available for execution on ...
It means that there is no binary for your GPU card. We only compile binaries for NV cards with CC 3.7 and up....
Read more >RuntimeError: CUDA error: no kernel image is available for ...
Hello, I'm trying to set up a machine learning model but have the following error showing up RuntimeError: CUDA error: no kernel image...
Read more >Pytorch CUDA error: no kernel image is available for ...
python - Pytorch CUDA error: no kernel image is available for execution on the device on RTX 3090 with cuda 11.1 - Stack...
Read more >CUDA error: no kernel image is available for execution on ...
Hi, I'm trying to run OpenNMT-py on an RTX 3090 from vast.ai and getting a CUDA error: Traceback (most recent call last): File ......
Read more >[NEED HELP] Solution for CUDA error -> no kernel image ...
RuntimeError : CUDA error: no kernel image is available for execution on the device. CUDA kernel errors might be asynchronously reported at ...
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
@ruodingt your Dockerfile specifies:
ENV TORCH_CUDA_ARCH_LIST="Maxwell;Maxwell+Tegra;Pascal;Volta;Turing"
However, your graphics card is:GPU 0 Tesla K80
which is Kepler. Did you try to specify smth like:ENV TORCH_CUDA_ARCH_LIST="Kepler;Kepler+Tesla;Maxwell;Maxwell+Tegra;Pascal;Volta;Turing"
?@vkhalidov @ppwwyyxx Thank you. It works after I change the the
ENV
.