question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

cuda error caused by negative tensor value

See original GitHub issue

Hi, thanks for your nice code again! But I got a wired error when run your code, error info as below:

THCudaCheck FAIL file=/opt/conda/conTHC/generic/THCTensorCopy.c line=20 error=59 : device-side assert triggered Traceback (most recent call last): File "train.py", line 229, in <module> main() File "/data1/jayzjwang/opt/anaconda3/envs/deeplab/lib/python3.5/site-packages/click/core.py", line return self.main(*args, **kwargs) File "/data1/jayzjwang/opt/anaconda3/envs/deeplab/lib/python3.5/site-packages/click/core.py", line rv = self.invoke(ctx) File "/data1/jayzjwang/opt/anaconda3/envs/deeplab/lib/python3.5/site-packages/click/core.py", line return ctx.invoke(self.callback, **ctx.params) File "/data1/jayzjwang/opt/anaconda3/envs/deeplab/lib/python3.5/site-packages/click/core.py", line return callback(*args, **kwargs) File "train.py", line 183, in main target_ = target_.to(device) RuntimeError: cuda runtime error (59) : device-side assert triggered at /opt/conda/conda-bld/pytorch_TensorCopy.c:20 Exception ignored in: <bound method _DataLoaderIter.__del__ of <torch.utils.data.dataloader._DataLoad Traceback (most recent call last): File "/data1/jayzjwang/opt/anaconda3/envs/deeplab/lib/python3.5/site-packages/torch/utils/data/data self._shutdown_workers() File "/data1/jayzjwang/opt/anaconda3/envs/deeplab/lib/python3.5/site-packages/torch/utils/data/data self.worker_result_queue.get() File "/data1/jayzjwang/opt/anaconda3/envs/deeplab/lib/python3.5/multiprocessing/queues.py", line 33 return ForkingPickler.loads(res) File "/data1/jayzjwang/opt/anaconda3/envs/deeplab/lib/python3.5/site-packages/torch/multiprocessinge_fd fd = df.detach() File "/data1/jayzjwang/opt/anaconda3/envs/deeplab/lib/python3.5/multiprocessing/resource_sharer.py" with _resource_sharer.get_connection(self._id) as conn: File "/data1/jayzjwang/opt/anaconda3/envs/deeplab/lib/python3.5/multiprocessing/resource_sharer.py" c = Client(address, authkey=process.current_process().authkey) File "/data1/jayzjwang/opt/anaconda3/envs/deeplab/lib/python3.5/multiprocessing/connection.py", lin c = SocketClient(address) File "/data1/jayzjwang/opt/anaconda3/envs/deeplab/lib/python3.5/multiprocessing/connection.py", lin s.connect(address) ConnectionRefusedError: [Errno 111] Connection refused

it may be caused by negative tensor value when set ignore_label to -1 in preprocessing label map according to this issue torch/cutorch#708, after I set the ignore label to 255 (I make minor change to your codes to run it on voc12), it can work fine

Issue Analytics

  • State:closed
  • Created 5 years ago
  • Comments:6 (3 by maintainers)

github_iconTop GitHub Comments

1reaction
kazuto1011commented, Jun 28, 2018

Anyway, I fixed config/voc12.yaml. Thank you for the information about synchronous batch normalization! I didn’t know that. I think that’s crucial but, to be honest, have no idea how to solve your problem now. Let me close this issue because the bn problem is out of scope. I will try to fix the problem, any PRs are welcome.

0reactions
LindaStcommented, Dec 2, 2019

I have no idea why the target_ = target_.to(device) cased the runtime error. Did the target tensor contain a negative value? My guess is that it is possible that the label 255 in “target_” would be denied by the loss criterion (the next line) which assumes indices 0 to 20 for classes and another index -1 for “ignore label”. And I think it can be solved by changing the “ignore label” from -1 to 255 as you said. Haven’t you modified your code during training? (doubt for the stack trace, sorry)

Why can the tensors not contain any negative values?

Read more comments on GitHub >

github_iconTop Results From Across the Web

Negative values encountered in unsigned quantization
Description. I use the Trt8.0 quantization function in my model. I just emplyed its PTQ. The quantization setting are shown as follows:
Read more >
CUBLAS_STATUS_INVALID_VA...
It seems that one of your operands is too large to fit in int32 (or negative, but that seems unlikely). But they don't...
Read more >
CUDA_ERROR_OUT_OF_MEM...
The problem is, that Tensorflow is greedy in allocating all available VRAM. That causes issues for some people. For Tensorflow 2.0 alpha /...
Read more >
torch.Tensor — PyTorch master documentation
FloatTensor.abs_() computes the absolute value in-place and returns the modified tensor, while torch.FloatTensor.abs() computes the result in a new tensor.
Read more >
numa node read from sysfs had negative value (-1), but there must ...
TensorFlow was able to open /sys/bus/pci/devices/%s/numa_node file where %s is id of GPU PCI card ( string pci_bus_id = CUDADriver::GetPCIBusID(device_) ).
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found