question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Using Global in utils.create_positions() cause issues in Multi GPU

See original GitHub issue

I am importing fairseq models into allennlp framework and using it. I ran into a issue in multi-gpu setting.

def make_positions(tensor, padding_idx, left_pad, onnx_trace=False):
    """Replace non-padding symbols with their position numbers.
    Position numbers begin at padding_idx+1.
    Padding symbols are ignored, but it is necessary to specify whether padding
    is added on the left side (left_pad=True) or right side (left_pad=False).
    """
    if onnx_trace:
        range_buf = torch._dim_arange(like=tensor, dim=1) + padding_idx + 1
        mask = tensor.ne(padding_idx)
        positions = range_buf.expand_as(tensor)
        if left_pad:
            positions = positions - mask.size(1) + mask.long().sum(dim=1).unsqueeze(1)
        return positions * mask.long() + padding_idx * (1 - mask.long())

    max_pos = padding_idx + 1 + tensor.size(1)
    if not hasattr(make_positions, 'range_buf'):
        make_positions.range_buf = tensor.new()
    make_positions.range_buf = make_positions.range_buf.type_as(tensor)
    if make_positions.range_buf.numel() < max_pos:
        torch.arange(padding_idx + 1, max_pos, out=make_positions.range_buf)
    mask = tensor.ne(padding_idx)
    positions = make_positions.range_buf[:tensor.size(1)].expand_as(tensor)
    if left_pad:
        positions = positions - mask.size(1) + mask.long().sum(dim=1).unsqueeze(1)
    return tensor.clone().masked_scatter_(mask, positions[mask])

make_positions.range_buf being a global causes this error to be thrown

‘RuntimeError: arguments are located on different GPUs at /opt/conda/conda-bld/pytorch_1535491974311/work/aten/src/THC/generated/…/generic/THCTensorMasked.cu:40’.

I successfuly fixed this issue by creating range tensor everytime instead of caching it.

My doubt is is this issue bypassed in way fairseq does multi GPU training, so the issue is with how allennlp does multi GPU. Or should it also be fixed here ?

Issue Analytics

  • State:closed
  • Created 5 years ago
  • Comments:5 (4 by maintainers)

github_iconTop GitHub Comments

1reaction
myleottcommented, Nov 14, 2018

The advantage is speed. Due to the global interpreter lock in python, feeding multiple GPUs from a single python process often causes python to be the bottleneck. Using separate processes avoids this.

Notably, this is also the same setup that one uses for DistributedDataParallel (i.e., one process per GPU), so once the code works for distributed training it should be trivial to do multiprocessing on a single machine.

1reaction
myleottcommented, Nov 14, 2018

In fairseq we launch distinct processes for each GPU and set the default CUDA device on each process: https://github.com/pytorch/fairseq/blob/7e60d45b017f6d08c607f57b9c4f6aa2ded08c97/train.py#L31

So, it should work as expected in fairseq, although I can see why it might not work with multiple GPUs on a single process 😃

Read more comments on GitHub >

github_iconTop Results From Across the Web

Using Global Tensor to Program on Multi-Device Multi-GPU
Global tensor can be executed on multi-device multi-GPU, and it's an interface to implement the Global View programming.
Read more >
Multi-GPU Training Not On SLURM · Issue #22 - GitHub
I noticed that the distributed multi-gpu training is based on the slurm platform, which is not easy to be run on other platforms....
Read more >
Allocating global variables on multiple GPUs - cuda
When you create a variable like this: __device__ int myval;. It is created at global scope. An allocation for it is made in...
Read more >
INTRODUCTION TO MATLAB PARALLEL COMPUTING ...
For some problems, GPUs achieve better performance than CPUs. • MATLAB GPU utilities are limited, but growing. • Basic GPU operations: >> n...
Read more >
Frequently Asked Questions - Numba
Numba considers global variables as compile-time constants. ... Numba can also target parallel execution on GPU architectures using its CUDA and HSA ...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found