Multi GPU training errorSee original GitHub issue
Hi while using multiple GPUs for training I get this:
File "/workspace/TResNet/src/models/tresnet/layers/anti_aliasing.py", line 40, in __call__ return F.conv2d(input_pad, self.filt, stride=2, padding=0, groups=input.shape) RuntimeError: Assertion `THCTensor_(checkGPU)(state, 3, input, output, weight)' failed. Some of weight/gradient/input tensors are located on different GPUs. Please move them to a single one. at /tmp/pip-r eq-build-cms73_uj/aten/src/THCUNN/generic/SpatialDepthwiseConvolution.cu:19
However single GPU training using
CUDA_VISIBLE_DEVICES=0 before my training script works fine. I can see the losses going down after iterations.
Can you help with this?
- Created 3 years ago
Top GitHub Comments
i added an option --remove_aa_jit. run with it, it should be ok for you.
As i said before, TResNet fully supports multi-GPUs, i trained on imagenet with 8xV100. your script is not well designed in terms of distributed. models should be defined after(!) you do ‘torch.cuda.set_device(rank)’, not before. if you insist on the opposite way, use the --remove_aa_jit flag.
i also added some general tips section for working with inplace-abn: https://github.com/mrT23/TResNet/blob/master/INPLACE_ABN_TIPS.md
all the best
I got this:
RuntimeError: attribute lookup is not defined on python value of type '_Environ': File "/workspace/TResNet/src/models/tresnet/layers/anti_aliasing.py", line 35 filt = (a[:, None] * a[None, :]).clone().detach() filt = filt / torch.sum(filt) self.filt = filt[None, None, :, :].repeat((self.channels, 1, 1, 1)).cuda(device=int(os.environ.get('RANK', 0))).half() ~~~~~~~~~~~~~~ <--- HERE
I also tried modifying the non-JIT Downsample to account for RANK, but that gave me the same originial error:
Some of weight/gradient/input tensors are located on different GPUs. Please move them to a single one. at /tmp/pip-r eq-build-cms73_uj/aten/src/THCUNN/generic/SpatialDepthwiseConvolution.cu:19
Do you have some suggestions to make a custom grad function to account for multi-GPUs?