cuda error caused by negative tensor value
See original GitHub issueHi, thanks for your nice code again! But I got a wired error when run your code, error info as below:
THCudaCheck FAIL file=/opt/conda/conTHC/generic/THCTensorCopy.c line=20 error=59 : device-side assert triggered Traceback (most recent call last): File "train.py", line 229, in <module> main() File "/data1/jayzjwang/opt/anaconda3/envs/deeplab/lib/python3.5/site-packages/click/core.py", line return self.main(*args, **kwargs) File "/data1/jayzjwang/opt/anaconda3/envs/deeplab/lib/python3.5/site-packages/click/core.py", line rv = self.invoke(ctx) File "/data1/jayzjwang/opt/anaconda3/envs/deeplab/lib/python3.5/site-packages/click/core.py", line return ctx.invoke(self.callback, **ctx.params) File "/data1/jayzjwang/opt/anaconda3/envs/deeplab/lib/python3.5/site-packages/click/core.py", line return callback(*args, **kwargs) File "train.py", line 183, in main target_ = target_.to(device) RuntimeError: cuda runtime error (59) : device-side assert triggered at /opt/conda/conda-bld/pytorch_TensorCopy.c:20 Exception ignored in: <bound method _DataLoaderIter.__del__ of <torch.utils.data.dataloader._DataLoad Traceback (most recent call last): File "/data1/jayzjwang/opt/anaconda3/envs/deeplab/lib/python3.5/site-packages/torch/utils/data/data self._shutdown_workers() File "/data1/jayzjwang/opt/anaconda3/envs/deeplab/lib/python3.5/site-packages/torch/utils/data/data self.worker_result_queue.get() File "/data1/jayzjwang/opt/anaconda3/envs/deeplab/lib/python3.5/multiprocessing/queues.py", line 33 return ForkingPickler.loads(res) File "/data1/jayzjwang/opt/anaconda3/envs/deeplab/lib/python3.5/site-packages/torch/multiprocessinge_fd fd = df.detach() File "/data1/jayzjwang/opt/anaconda3/envs/deeplab/lib/python3.5/multiprocessing/resource_sharer.py" with _resource_sharer.get_connection(self._id) as conn: File "/data1/jayzjwang/opt/anaconda3/envs/deeplab/lib/python3.5/multiprocessing/resource_sharer.py" c = Client(address, authkey=process.current_process().authkey) File "/data1/jayzjwang/opt/anaconda3/envs/deeplab/lib/python3.5/multiprocessing/connection.py", lin c = SocketClient(address) File "/data1/jayzjwang/opt/anaconda3/envs/deeplab/lib/python3.5/multiprocessing/connection.py", lin s.connect(address) ConnectionRefusedError: [Errno 111] Connection refused
it may be caused by negative tensor value when set ignore_label to -1 in preprocessing label map according to this issue torch/cutorch#708, after I set the ignore label to 255 (I make minor change to your codes to run it on voc12), it can work fine
Issue Analytics
- State:
- Created 5 years ago
- Comments:6 (3 by maintainers)
Top GitHub Comments
Anyway, I fixed
config/voc12.yaml
. Thank you for the information about synchronous batch normalization! I didn’t know that. I think that’s crucial but, to be honest, have no idea how to solve your problem now. Let me close this issue because the bn problem is out of scope. I will try to fix the problem, any PRs are welcome.Why can the tensors not contain any negative values?