Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Getting error in a multi gpu machine

See original GitHub issue

Description A clear and concise description of what the bug is. My model is working fine when I use gpu:0 but it is giving error when I use gpu:1.

I got this error:

Traceback (most recent call last):
  File "zst_client.py", line 53, in <module>
    run_inference('Jupiter’s Biggest Moons Started as Tiny Grains of Hail')
  File "zst_client.py", line 39, in run_inference
    response = triton_client.infer(model_name,         model_version=model_version, inputs=[input0, input1], outputs=[output])
  File "/usr/local/lib/python3.8/dist-packages/tritonclient/http/__init__.py", line 1102, in infer
    _raise_if_error(response)
  File "/usr/local/lib/python3.8/dist-packages/tritonclient/http/__init__.py", line 63, in _raise_if_error
    raise error
tritonclient.utils.InferenceServerException: PyTorch execute failure: The following operation failed in the TorchScript interpreter.
Traceback of TorchScript, serialized code (most recent call last):
  File "code/__torch__.py", line 12, in forward
    input_ids = torch.to(data, dtype=4, layout=0, device=torch.device("cuda"), pin_memory=False, non_blocking=False, copy=False, memory_format=None)
    attention_mask0 = torch.to(attention_mask, dtype=4, layout=0, device=torch.device("cuda"), pin_memory=False, non_blocking=False, copy=False, memory_format=None)
    _1 = (_0).forward(input_ids, attention_mask0, )
          ~~~~~~~~~~~ <--- HERE
    return (_1,)
  File "code/__torch__/transformers/modeling_xlm_roberta.py", line 11, in forward
    attention_mask: Tensor) -> Tensor:
    _0 = self.classifier
    _1 = (self.roberta).forward(input_ids, attention_mask, )
          ~~~~~~~~~~~~~~~~~~~~~ <--- HERE
    return (_0).forward(_1, )
  File "code/__torch__/transformers/modeling_roberta.py", line 21, in forward
    _7 = torch.to(extended_attention_mask, 6, False, False, None)
    attention_mask0 = torch.mul(torch.rsub(_7, 1., 1), CONSTANTS.c0)
    _8 = (_0).forward((_1).forward(input_ids, input, ), attention_mask0, )
                       ~~~~~~~~~~~ <--- HERE
    return _8
class RobertaEmbeddings(Module):
  File "code/__torch__/transformers/modeling_roberta.py", line 47, in forward
    _16 = torch.add(_15, CONSTANTS.c1, alpha=1)
    input0 = torch.to(_16, dtype=4, layout=0, device=torch.device("cuda:0"), pin_memory=False, non_blocking=False, copy=False, memory_format=None)
    _17 = (_13).forward(input_ids, )
           ~~~~~~~~~~~~ <--- HERE
    _18 = (_12).forward(input0, )
    _19 = (_11).forward(input, )
  File "code/__torch__/torch/nn/modules/sparse.py", line 8, in forward
  def forward(self: __torch__.torch.nn.modules.sparse.Embedding,
    input_ids: Tensor) -> Tensor:
    inputs_embeds = torch.embedding(self.weight, input_ids, 1, False, False)
                    ~~~~~~~~~~~~~~~ <--- HERE
    return inputs_embeds

Traceback of TorchScript, original code (most recent call last):
/home/cmeena/.local/lib/python3.8/site-packages/torch/nn/functional.py(1814): embedding
/home/cmeena/.local/lib/python3.8/site-packages/torch/nn/modules/sparse.py(124): forward
/home/cmeena/.local/lib/python3.8/site-packages/torch/nn/modules/module.py(704): _slow_forward
/home/cmeena/.local/lib/python3.8/site-packages/torch/nn/modules/module.py(720): _call_impl
/home/cmeena/.local/lib/python3.8/site-packages/transformers/modeling_roberta.py(117): forward
/home/cmeena/.local/lib/python3.8/site-packages/torch/nn/modules/module.py(704): _slow_forward
/home/cmeena/.local/lib/python3.8/site-packages/torch/nn/modules/module.py(720): _call_impl
/home/cmeena/.local/lib/python3.8/site-packages/transformers/modeling_roberta.py(674): forward
/home/cmeena/.local/lib/python3.8/site-packages/torch/nn/modules/module.py(704): _slow_forward
/home/cmeena/.local/lib/python3.8/site-packages/torch/nn/modules/module.py(720): _call_impl
/home/cmeena/.local/lib/python3.8/site-packages/transformers/modeling_roberta.py(989): forward
/home/cmeena/.local/lib/python3.8/site-packages/torch/nn/modules/module.py(704): _slow_forward
/home/cmeena/.local/lib/python3.8/site-packages/torch/nn/modules/module.py(720): _call_impl
robertamodelgpu.py(17): forward
/home/cmeena/.local/lib/python3.8/site-packages/torch/nn/modules/module.py(704): _slow_forward
/home/cmeena/.local/lib/python3.8/site-packages/torch/nn/modules/module.py(720): _call_impl
/home/cmeena/.local/lib/python3.8/site-packages/torch/jit/__init__.py(1109): trace_module
/home/cmeena/.local/lib/python3.8/site-packages/torch/jit/__init__.py(953): trace
robertamodelgpu.py(20): <module>
RuntimeError: Input, output and indices must be on the current device

Triton Information What version of Triton are you using? 21.02

Are you using the Triton container or did you build it yourself? I am using Triton container.

To Reproduce Steps to reproduce the behavior. I have used model from this blog for creating this experiment: https://medium.com/nvidia-ai/how-to-deploy-almost-any-hugging-face-model-on-nvidia-triton-inference-server-with-an-8ee7ec0e6fc4 Use this model:

import torch
from transformers import XLMRobertaForSequenceClassification, XLMRobertaTokenizer

R_tokenizer = XLMRobertaTokenizer.from_pretrained('joeddav/xlm-roberta-large-xnli')
premise = 'Jupiters Biggest Moons Started as Tiny Grains of Hail'
hypothesis = 'This text is about space and cosomos'

input_ids = R_tokenizer.encode(premise, hypothesis, return_tensors='pt', max_length=256, truncation=True, padding='max_length')
mask = input_ids != -1
mask = mask.long()

class PyTorch_to_TorchScript(torch.nn.Module):
    def __init__(self):
        super(PyTorch_to_TorchScript, self).__init__()
        self.model = XLMRobertaForSequenceClassification.from_pretrained('joeddav/xlm-roberta-large-xnli').cuda()
    def forward(self, data, attention_mask=None):
        return tuple(self.model(data.cuda(), attention_mask.cuda()))

pt_model = PyTorch_to_TorchScript().eval()
traced_script_module = torch.jit.trace(pt_model, (input_ids, mask), strict=False)
traced_script_module.save("mode-tuplel.pt")

Try these two config:

name: "zst"
platform: "pytorch_libtorch"
input [
 {
    name: "input__0"
    data_type: TYPE_INT32
    dims: [ -1,-1 ]
  } ,
{
    name: "input__1"
    data_type: TYPE_INT32
    dims: [ -1,-1 ]
  }
]
output {
    name: "output__0"
    data_type: TYPE_FP32
    dims: [-1, -1]
  }
  instance_group [
    {
      count: 1
      kind: KIND_GPU
      gpus: [ 0 ]
    }
  ]

Use this config:

name: "zst"
platform: "pytorch_libtorch"
input [
 {
    name: "input__0"
    data_type: TYPE_INT32
    dims: [ -1,-1 ]
  } ,
{
    name: "input__1"
    data_type: TYPE_INT32
    dims: [ -1,-1 ]
  }
]
output {
    name: "output__0"
    data_type: TYPE_FP32
    dims: [-1, -1]
  }
  instance_group [
    {
      count: 1
      kind: KIND_GPU
      gpus: [ 1 ]
    }
  ]

Use machine which has 2 or more gpus.

You will find the for gpu:0 it works fine but it will give error for gpu:1.

Describe the models (framework, inputs, outputs), ideally include the model configuration file (if using an ensemble include the model configuration file for that as well).

Expected behavior I want it to work in instance of multiple gpu devices.