Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Using Global in utils.create_positions() cause issues in Multi GPU

See original GitHub issue

I am importing fairseq models into allennlp framework and using it. I ran into a issue in multi-gpu setting.

def make_positions(tensor, padding_idx, left_pad, onnx_trace=False):
    """Replace non-padding symbols with their position numbers.
    Position numbers begin at padding_idx+1.
    Padding symbols are ignored, but it is necessary to specify whether padding
    is added on the left side (left_pad=True) or right side (left_pad=False).
    """
    if onnx_trace:
        range_buf = torch._dim_arange(like=tensor, dim=1) + padding_idx + 1
        mask = tensor.ne(padding_idx)
        positions = range_buf.expand_as(tensor)
        if left_pad:
            positions = positions - mask.size(1) + mask.long().sum(dim=1).unsqueeze(1)
        return positions * mask.long() + padding_idx * (1 - mask.long())

    max_pos = padding_idx + 1 + tensor.size(1)
    if not hasattr(make_positions, 'range_buf'):
        make_positions.range_buf = tensor.new()
    make_positions.range_buf = make_positions.range_buf.type_as(tensor)
    if make_positions.range_buf.numel() < max_pos:
        torch.arange(padding_idx + 1, max_pos, out=make_positions.range_buf)
    mask = tensor.ne(padding_idx)
    positions = make_positions.range_buf[:tensor.size(1)].expand_as(tensor)
    if left_pad:
        positions = positions - mask.size(1) + mask.long().sum(dim=1).unsqueeze(1)
    return tensor.clone().masked_scatter_(mask, positions[mask])

make_positions.range_buf being a global causes this error to be thrown

‘RuntimeError: arguments are located on different GPUs at /opt/conda/conda-bld/pytorch_1535491974311/work/aten/src/THC/generated/…/generic/THCTensorMasked.cu:40’.

I successfuly fixed this issue by creating range tensor everytime instead of caching it.

My doubt is is this issue bypassed in way fairseq does multi GPU training, so the issue is with how allennlp does multi GPU. Or should it also be fixed here ?

Issue Analytics

State:
Created 5 years ago
Comments:5 (4 by maintainers)

Top GitHub Comments

1reaction

myleottcommented, Nov 14, 2018

The advantage is speed. Due to the global interpreter lock in python, feeding multiple GPUs from a single python process often causes python to be the bottleneck. Using separate processes avoids this.

Notably, this is also the same setup that one uses for DistributedDataParallel (i.e., one process per GPU), so once the code works for distributed training it should be trivial to do multiprocessing on a single machine.

1reaction

myleottcommented, Nov 14, 2018

In fairseq we launch distinct processes for each GPU and set the default CUDA device on each process: https://github.com/pytorch/fairseq/blob/7e60d45b017f6d08c607f57b9c4f6aa2ded08c97/train.py#L31

So, it should work as expected in fairseq, although I can see why it might not work with multiple GPUs on a single process 😃

Top Results From Across the Web

Using Global Tensor to Program on Multi-Device Multi-GPU

Global tensor can be executed on multi-device multi-GPU, and it's an interface to implement the Global View programming.

Multi-GPU Training Not On SLURM · Issue #22 - GitHub

I noticed that the distributed multi-gpu training is based on the slurm platform, which is not easy to be run on other platforms....

Allocating global variables on multiple GPUs - cuda

When you create a variable like this: __device__ int myval;. It is created at global scope. An allocation for it is made in...

INTRODUCTION TO MATLAB PARALLEL COMPUTING ...

For some problems, GPUs achieve better performance than CPUs. • MATLAB GPU utilities are limited, but growing. • Basic GPU operations: >> n...

Frequently Asked Questions - Numba

Numba considers global variables as compile-time constants. ... Numba can also target parallel execution on GPU architectures using its CUDA and HSA ...