AssertionError: Torch not compiled with CUDA enabled - CPU/LINUX TRAINING ERROR
See original GitHub issueHello,
I recently created a venv and downloaded pytorch in the following way (cpu only):
pip install torch==1.5.1+cpu torchvision==0.6.1+cpu -f https://download.pytorch.org/whl/torch_stable.html
Then downloaded the pre-built detectron2 for linux & cpu with the following (all other prereqs are installed also)
python -m pip install detectron2 -f
https://dl.fbaipublicfiles.com/detectron2/wheels/cpu/torch1.5/index.html
Instructions To Reproduce the Issue:
I am training on a custom dataset, and the trainer.train() line is seeing the following error:
AssertionError: Torch not compiled with CUDA enabled
- Here is my code the get there
# Some basic setup:
# Setup detectron2 logger
import detectron2
from detectron2.utils.logger import setup_logger
setup_logger()
# import some common libraries
import numpy as np
import cv2
import os
import random
from matplotlib import pyplot as plt
# import some common detectron2 utilities
from detectron2 import model_zoo
from detectron2.engine import DefaultPredictor
from detectron2.config import get_cfg
from detectron2.utils.visualizer import Visualizer
from detectron2.data import MetadataCatalog
from detectron2.structures import BoxMode
from detectron2.data.datasets import register_coco_instances
from detectron2.data.catalog import DatasetCatalog
from detectron2.engine import HookBase
register_coco_instances("boat_train", {}, "/home/Documents/train/instances.json", "/home/Documents/train")
register_coco_instances("boat_val", {}, "/home/Documents/val/instances.json", "/home/Documents/val")
from detectron2.engine import DefaultTrainer
from detectron2.engine import TrainerBase
#Specify Model yaml & weights to grab
cfg = get_cfg()
cfg.merge_from_file(model_zoo.get_config_file("COCO-Detection/faster_rcnn_R_50_C4_1x.yaml"))
cfg.MODEL.WEIGHTS = "/home/svidelock/source/Detectron/model_final_721ade.pkl" # Let training initialize from model zoo
#Spcify DIR for output, if not specified will create "output" DIR
# cfg.OUTPUT_DIR = '/home/svidelock/source/Detectron/HyperParamDetectron/output2/'
#Specify Datasets
cfg.DATASETS.TRAIN = ("boat_train",) #list of the pre-computed proposal files for trianing
cfg.DATASETS.TEST = ("boat_val",) #validation set
#Hyperparams
cfg.SOLVER.IMS_PER_BATCH = 2 #means that in 1 iteration the model sees 2 images
cfg.SOLVER.BASE_LR = 0.02 #learning rate
#Some other configurable items
cfg.DATALOADER.NUM_WORKERS = 2 # depends on harware config ...
# cfg.SOLVER.WARMUP_ITERS = 1000 #constant learning rate
# cfg.SOLVER.STEPS = (1000, 1500) #Decaying learning rate
# cfg.SOLVER.GAMMA = 0.001 # The iteration number to decrease learning rate by GAMMA
cfg.SOLVER.MAX_ITER = 500 # Model will stop after this many iterations
cfg.MODEL.ROI_HEADS.BATCH_SIZE_PER_IMAGE = 128 #look into
cfg.MODEL.ROI_HEADS.NUM_CLASSES = 1 # only has one class (boat)
#specify if CPU Training
cfg.MODEL.DEVICE='cpu'#cpu training
#Checkpoint/ValidationSet Params
cfg.TEST.EVAL_PERIOD = 20 # Tests validation set every 20 itterations
cfg.SOLVER.CHECKPOINT_PERIOD = cfg.TEST.EVAL_PERIOD #saves a checkpoint model each time we validate
os.makedirs(cfg.OUTPUT_DIR, exist_ok=True)
trainer = DefaultTrainer(cfg)
trainer.resume_or_load(resume=False)
trainer.train()
- full logs you observed:
[07/17 13:16:51 d2.engine.train_loop]: Starting training from iteration 0
ERROR [07/17 13:17:05 d2.engine.train_loop]: Exception during training:
Traceback (most recent call last):
File "/home/svidelock/unineunet-test2/lib/python3.6/site-packages/detectron2/engine/train_loop.py", line 130, in train
self.run_step()
File "/home/svidelock/unineunet-test2/lib/python3.6/site-packages/detectron2/engine/train_loop.py", line 227, in run_step
with torch.cuda.stream(torch.cuda.Stream()):
File "/home/svidelock/unineunet-test2/lib/python3.6/site-packages/torch/cuda/streams.py", line 21, in __new__
with torch.cuda.device(device):
File "/home/svidelock/unineunet-test2/lib/python3.6/site-packages/torch/cuda/__init__.py", line 201, in __init__
self.idx = _get_device_index(device, optional=True)
File "/home/svidelock/unineunet-test2/lib/python3.6/site-packages/torch/cuda/_utils.py", line 31, in _get_device_index
return torch.cuda.current_device()
File "/home/svidelock/unineunet-test2/lib/python3.6/site-packages/torch/cuda/__init__.py", line 330, in current_device
_lazy_init()
File "/home/svidelock/unineunet-test2/lib/python3.6/site-packages/torch/cuda/__init__.py", line 149, in _lazy_init
_check_driver()
File "/home/svidelock/unineunet-test2/lib/python3.6/site-packages/torch/cuda/__init__.py", line 47, in _check_driver
raise AssertionError("Torch not compiled with CUDA enabled")
AssertionError: Torch not compiled with CUDA enabled
[07/17 13:17:05 d2.engine.hooks]: Total training time: 0:00:13 (0:00:00 on hooks)
---------------------------------------------------------------------------
AssertionError Traceback (most recent call last)
<ipython-input-7-7c9d1293789c> in <module>
4 # trainer.register_hooks([early_stoping])
5 trainer.resume_or_load(resume=False)
----> 6 trainer.train()
~/unineunet-test2/lib/python3.6/site-packages/detectron2/engine/defaults.py in train(self)
396 OrderedDict of results, if evaluation is enabled. Otherwise None.
397 """
--> 398 super().train(self.start_iter, self.max_iter)
399 if len(self.cfg.TEST.EXPECTED_RESULTS) and comm.is_main_process():
400 assert hasattr(
~/unineunet-test2/lib/python3.6/site-packages/detectron2/engine/train_loop.py in train(self, start_iter, max_iter)
128 for self.iter in range(start_iter, max_iter):
129 self.before_step()
--> 130 self.run_step()
131 self.after_step()
132 except Exception:
~/unineunet-test2/lib/python3.6/site-packages/detectron2/engine/train_loop.py in run_step(self)
225
226 # use a new stream so the ops don't wait for DDP
--> 227 with torch.cuda.stream(torch.cuda.Stream()):
228 metrics_dict = loss_dict
229 metrics_dict["data_time"] = data_time
~/unineunet-test2/lib/python3.6/site-packages/torch/cuda/streams.py in __new__(cls, device, priority, **kwargs)
19
20 def __new__(cls, device=None, priority=0, **kwargs):
---> 21 with torch.cuda.device(device):
22 return super(Stream, cls).__new__(cls, priority=priority, **kwargs)
23
~/unineunet-test2/lib/python3.6/site-packages/torch/cuda/__init__.py in __init__(self, device)
199
200 def __init__(self, device):
--> 201 self.idx = _get_device_index(device, optional=True)
202 self.prev_idx = -1
203
~/unineunet-test2/lib/python3.6/site-packages/torch/cuda/_utils.py in _get_device_index(device, optional)
29 if optional:
30 # default cuda device index
---> 31 return torch.cuda.current_device()
32 else:
33 raise ValueError('Expected a cuda device with a specified index '
~/unineunet-test2/lib/python3.6/site-packages/torch/cuda/__init__.py in current_device()
328 def current_device():
329 r"""Returns the index of a currently selected device."""
--> 330 _lazy_init()
331 return torch._C._cuda_getDevice()
332
~/unineunet-test2/lib/python3.6/site-packages/torch/cuda/__init__.py in _lazy_init()
147 raise RuntimeError(
148 "Cannot re-initialize CUDA in forked subprocess. " + msg)
--> 149 _check_driver()
150 if _cudart is None:
151 raise AssertionError(
~/unineunet-test2/lib/python3.6/site-packages/torch/cuda/__init__.py in _check_driver()
45 def _check_driver():
46 if not hasattr(torch._C, '_cuda_isDriverSufficient'):
---> 47 raise AssertionError("Torch not compiled with CUDA enabled")
48 if not torch._C._cuda_isDriverSufficient():
49 if torch._C._cuda_getDriverVersion() == 0:
AssertionError: Torch not compiled with CUDA enabled
Expected behavior:
The model should run. I created a virtual environment in the same way about a week ago and have no issues, but when I recreate a new virtual environment it (with all cpu installs, and specifying cpu in the config, I receive the above error.
Environment:
sys.platform linux
Python 3.6.9 (default, Apr 18 2020, 01:56:04) [GCC 8.4.0]
numpy 1.19.0
detectron2 0.2 @/home/svidelock/unineunet-test2/lib/python3.6/site-packages/detectron2
Compiler GCC 7.3
CUDA compiler not available
DETECTRON2_ENV_MODULE <not set>
PyTorch 1.5.1+cpu @/home/svidelock/unineunet-test2/lib/python3.6/site-packages/torch
PyTorch debug build False
GPU available False
Pillow 7.2.0
torchvision 0.6.1+cpu @/home/svidelock/unineunet-test2/lib/python3.6/site-packages/torchvision
fvcore 0.1.1.post20200716
cv2 4.3.0
--------------------- ----------------------------------------------------------------------------------
PyTorch built with:
- GCC 7.3
- C++ Version: 201402
- Intel(R) Math Kernel Library Version 2019.0.5 Product Build 20190808 for Intel(R) 64 architecture applications
- Intel(R) MKL-DNN v0.21.1 (Git Hash 7d2fd500bc78936d1d648ca713b901012f470dbc)
- OpenMP 201511 (a.k.a. OpenMP 4.5)
- NNPACK is enabled
- CPU capability usage: AVX2
- Build settings: BLAS=MKL, BUILD_TYPE=Release, CXX_FLAGS= -Wno-deprecated -fvisibility-inlines-hidden -fopenmp -DNDEBUG -DUSE_FBGEMM -DUSE_QNNPACK -DUSE_PYTORCH_QNNPACK -DUSE_XNNPACK -DUSE_INTERNAL_THREADPOOL_IMPL -O2 -fPIC -Wno-narrowing -Wall -Wextra -Werror=return-type -Wno-missing-field-initializers -Wno-type-limits -Wno-array-bounds -Wno-unknown-pragmas -Wno-sign-compare -Wno-unused-parameter -Wno-unused-variable -Wno-unused-function -Wno-unused-result -Wno-strict-overflow -Wno-strict-aliasing -Wno-error=deprecated-declarations -Wno-stringop-overflow -Wno-error=pedantic -Wno-error=redundant-decls -Wno-error=old-style-cast -fdiagnostics-color=always -faligned-new -Wno-unused-but-set-variable -Wno-maybe-uninitialized -fno-math-errno -fno-trapping-math -Werror=format -Wno-stringop-overflow, PERF_WITH_AVX=1, PERF_WITH_AVX2=1, PERF_WITH_AVX512=1, USE_CUDA=0, USE_EXCEPTION_PTR=1, USE_GFLAGS=OFF, USE_GLOG=OFF, USE_MKL=ON, USE_MKLDNN=ON, USE_MPI=OFF, USE_NCCL=OFF, USE_NNPACK=ON, USE_OPENMP=ON, USE_STATIC_DISPATCH=OFF,
Issue Analytics
- State:
- Created 3 years ago
- Comments:12
You can solve this issue by passing in the MODEL.DEVICE parameter for cfg.
For example:
@Smikha good job, it works