Using Global in utils.create_positions() cause issues in Multi GPU
See original GitHub issueI am importing fairseq models into allennlp framework and using it. I ran into a issue in multi-gpu setting.
def make_positions(tensor, padding_idx, left_pad, onnx_trace=False):
"""Replace non-padding symbols with their position numbers.
Position numbers begin at padding_idx+1.
Padding symbols are ignored, but it is necessary to specify whether padding
is added on the left side (left_pad=True) or right side (left_pad=False).
"""
if onnx_trace:
range_buf = torch._dim_arange(like=tensor, dim=1) + padding_idx + 1
mask = tensor.ne(padding_idx)
positions = range_buf.expand_as(tensor)
if left_pad:
positions = positions - mask.size(1) + mask.long().sum(dim=1).unsqueeze(1)
return positions * mask.long() + padding_idx * (1 - mask.long())
max_pos = padding_idx + 1 + tensor.size(1)
if not hasattr(make_positions, 'range_buf'):
make_positions.range_buf = tensor.new()
make_positions.range_buf = make_positions.range_buf.type_as(tensor)
if make_positions.range_buf.numel() < max_pos:
torch.arange(padding_idx + 1, max_pos, out=make_positions.range_buf)
mask = tensor.ne(padding_idx)
positions = make_positions.range_buf[:tensor.size(1)].expand_as(tensor)
if left_pad:
positions = positions - mask.size(1) + mask.long().sum(dim=1).unsqueeze(1)
return tensor.clone().masked_scatter_(mask, positions[mask])
make_positions.range_buf
being a global causes this error to be thrown
‘RuntimeError: arguments are located on different GPUs at /opt/conda/conda-bld/pytorch_1535491974311/work/aten/src/THC/generated/…/generic/THCTensorMasked.cu:40’.
I successfuly fixed this issue by creating range tensor everytime instead of caching it.
My doubt is is this issue bypassed in way fairseq does multi GPU training, so the issue is with how allennlp does multi GPU. Or should it also be fixed here ?
Issue Analytics
- State:
- Created 5 years ago
- Comments:5 (4 by maintainers)
Top Results From Across the Web
Using Global Tensor to Program on Multi-Device Multi-GPU
Global tensor can be executed on multi-device multi-GPU, and it's an interface to implement the Global View programming.
Read more >Multi-GPU Training Not On SLURM · Issue #22 - GitHub
I noticed that the distributed multi-gpu training is based on the slurm platform, which is not easy to be run on other platforms....
Read more >Allocating global variables on multiple GPUs - cuda
When you create a variable like this: __device__ int myval;. It is created at global scope. An allocation for it is made in...
Read more >INTRODUCTION TO MATLAB PARALLEL COMPUTING ...
For some problems, GPUs achieve better performance than CPUs. • MATLAB GPU utilities are limited, but growing. • Basic GPU operations: >> n...
Read more >Frequently Asked Questions - Numba
Numba considers global variables as compile-time constants. ... Numba can also target parallel execution on GPU architectures using its CUDA and HSA ...
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
The advantage is speed. Due to the global interpreter lock in python, feeding multiple GPUs from a single python process often causes python to be the bottleneck. Using separate processes avoids this.
Notably, this is also the same setup that one uses for DistributedDataParallel (i.e., one process per GPU), so once the code works for distributed training it should be trivial to do multiprocessing on a single machine.
In fairseq we launch distinct processes for each GPU and set the default CUDA device on each process: https://github.com/pytorch/fairseq/blob/7e60d45b017f6d08c607f57b9c4f6aa2ded08c97/train.py#L31
So, it should work as expected in fairseq, although I can see why it might not work with multiple GPUs on a single process 😃