Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Successfully loaded torchscript model failed with "CUDA error: CUBLAS_STATUS_NOT_INITIALIZED" when called for inference

See original GitHub issue

Description I converted a pytorch model to torchscript using the following script: https://gist.github.com/keskarnitish/1061cbd101ab186e2d80c7877517e7ee#file-saved_pytorch_model-py.

I tested the model using

import torch
model = torch.jit.load('model.pt')
example_outputs = model(example_inputs['input_ids'])

and it worked as expected.

I then deployed tritonserver:20.03-py3 on GKE on a node with T4 GPU.

I ran nvidia-smi on the node and got:

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 418.67       Driver Version: 418.67       CUDA Version: 10.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla T4            Off  | 00000000:00:04.0 Off |                    0 |
| N/A   62C    P0    32W /  70W |   3163MiB / 15079MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
+-----------------------------------------------------------------------------+

The triton server successfully loaded the model on the node. I checked the api status and it said that the model is ready.

But when I ran the perf_client, I got the following on the server logs:

I0525 05:24:42.733448 1 libtorch_backend.cc:538] Running bert with 1 request payloads
I0525 05:24:42.734669 1 pinned_memory_manager.cc:131] pinned memory allocation: size 256, addr 0x7f8a20000090
I0525 05:24:43.009041 1 libtorch_backend.cc:804] CUDA error: CUBLAS_STATUS_NOT_INITIALIZED when calling `cublasCreate(handle)`
The above operation failed in interpreter.
Traceback (most recent call last):
Serialized   File "code/__torch__.py", line 9
    _0 = self.model
    input_ids = torch.to(data, dtype=4, layout=0, device=torch.device("cuda"), pin_memory=False, non_blocking=False, copy=False, memory_format=None)
    return ((_0).forward(input_ids, ),)
             ~~~~~~~~~~~ <--- HERE
Serialized   File "code/__torch__/transformers/modeling_bert.py", line 10, in forward
    input_ids: Tensor) -> Tensor:
    _0 = self.classifier
    _1 = (self.dropout).forward((self.bert).forward(input_ids, ), )
                                 ~~~~~~~~~~~~~~~~~~ <--- HERE
    return (_0).forward(_1, )
class BertModel(Module):
Serialized   File "code/__torch__/transformers/modeling_bert.py", line 35, in forward
    _12 = torch.to(extended_attention_mask, 6, False, False, None)
    attention_mask0 = torch.mul(torch.rsub(_12, 1., 1), CONSTANTS.c0)
    _13 = (_3).forward((_4).forward(input_ids, input, ), attention_mask0, )
           ~~~~~~~~~~~ <--- HERE
    return (_2).forward(_13, )
class BertEmbeddings(Module):
Serialized   File "code/__torch__/transformers/modeling_bert.py", line 73, in forward
    attention_mask: Tensor) -> Tensor:
    _26 = getattr(self.layer, "1")
    _27 = (getattr(self.layer, "0")).forward(argument_1, attention_mask, )
           ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ <--- HERE
    _28 = getattr(self.layer, "2")
    _29 = (_26).forward(_27, attention_mask, )
Serialized   File "code/__torch__/transformers/modeling_bert.py", line 107, in forward
    _49 = self.output
    _50 = self.intermediate
    _51 = (self.attention).forward(argument_1, attention_mask, )
           ~~~~~~~~~~~~~~~~~~~~~~~ <--- HERE
    _52 = (_49).forward((_50).forward(_51, ), _51, )
    return _52
Serialized   File "code/__torch__/transformers/modeling_bert.py", line 119, in forward
    attention_mask: Tensor) -> Tensor:
    _53 = self.output
    _54 = (self.self).forward(argument_1, attention_mask, )
           ~~~~~~~~~~~~~~~~~~ <--- HERE
    return (_53).forward(_54, argument_1, )
class BertSelfAttention(Module):
Serialized   File "code/__torch__/transformers/modeling_bert.py", line 134, in forward
    _56 = self.value
    _57 = self.key
    _58 = (self.query).forward(argument_1, )
           ~~~~~~~~~~~~~~~~~~~ <--- HERE
    _59 = (_57).forward(argument_1, )
    _60 = (_56).forward(argument_1, )
/usr/local/lib/python3.6/dist-packages/torch/nn/functional.py(1612): linear
/usr/local/lib/python3.6/dist-packages/torch/nn/modules/linear.py(87): forward
/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py(534): _slow_forward
/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py(548): __call__
/usr/local/lib/python3.6/dist-packages/transformers/modeling_bert.py(216): forward
/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py(534): _slow_forward
/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py(548): __call__
/usr/local/lib/python3.6/dist-packages/transformers/modeling_bert.py(314): forward
/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py(534): _slow_forward
/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py(548): __call__
/usr/local/lib/python3.6/dist-packages/transformers/modeling_bert.py(368): forward
/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py(534): _slow_forward
/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py(548): __call__
/usr/local/lib/python3.6/dist-packages/transformers/modeling_bert.py(407): forward
/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py(534): _slow_forward
/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py(548): __call__
/usr/local/lib/python3.6/dist-packages/transformers/modeling_bert.py(734): forward
/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py(534): _slow_forward
/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py(548): __call__
/usr/local/lib/python3.6/dist-packages/transformers/modeling_bert.py(1142): forward
/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py(534): _slow_forward
/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py(548): __call__
<ipython-input-2-afc347149dec>(9): forward
/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py(534): _slow_forward
/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py(548): __call__
/usr/local/lib/python3.6/dist-packages/torch/jit/__init__.py(1027): trace_module
/usr/local/lib/python3.6/dist-packages/torch/jit/__init__.py(875): trace
<ipython-input-2-afc347149dec>(13): <module>
/usr/local/lib/python3.6/dist-packages/IPython/core/interactiveshell.py(2882): run_code
/usr/local/lib/python3.6/dist-packages/IPython/core/interactiveshell.py(2822): run_ast_nodes
/usr/local/lib/python3.6/dist-packages/IPython/core/interactiveshell.py(2718): run_cell
/usr/local/lib/python3.6/dist-packages/ipykernel/zmqshell.py(537): run_cell
/usr/local/lib/python3.6/dist-packages/ipykernel/ipkernel.py(208): do_execute
/usr/local/lib/python3.6/dist-packages/ipykernel/kernelbase.py(399): execute_request
/usr/local/lib/python3.6/dist-packages/ipykernel/kernelbase.py(233): dispatch_shell
/usr/local/lib/python3.6/dist-packages/ipykernel/kernelbase.py(283): dispatcher
/usr/local/lib/python3.6/dist-packages/tornado/stack_context.py(277): null_wrapper
/usr/local/lib/python3.6/dist-packages/zmq/eventloop/zmqstream.py(438): _run_callback
/usr/local/lib/python3.6/dist-packages/zmq/eventloop/zmqstream.py(486): _handle_recv
/usr/local/lib/python3.6/dist-packages/zmq/eventloop/zmqstream.py(456): _handle_events
/usr/local/lib/python3.6/dist-packages/tornado/stack_context.py(277): null_wrapper
/usr/local/lib/python3.6/dist-packages/tornado/ioloop.py(888): start
/usr/local/lib/python3.6/dist-packages/ipykernel/kernelapp.py(499): start
/usr/local/lib/python3.6/dist-packages/traitlets/config/application.py(664): launch_instance
/usr/local/lib/python3.6/dist-packages/ipykernel_launcher.py(16): <module>
/usr/lib/python3.6/runpy.py(85): _run_code
/usr/lib/python3.6/runpy.py(193): _run_module_as_main
Serialized   File "code/__torch__/torch/nn/modules/linear.py", line 9, in forward
    argument_1: Tensor) -> Tensor:
    _0 = self.bias
    output = torch.matmul(argument_1, torch.t(self.weight))
             ~~~~~~~~~~~~ <--- HERE
    return torch.add_(output, _0, alpha=1)

I0525 05:24:43.009080 1 pinned_memory_manager.cc:158] pinned memory deallocation: addr 0x7f8a20000090

Triton Information What version of Triton are you using? 20.03

Are you using the Triton container or did you build it yourself? Triton container

To Reproduce Steps to reproduce the behavior.

See description.

Expected behavior The server should not return any error.

Issue Analytics

State:
Created 3 years ago
Comments:17 (9 by maintainers)

Top GitHub Comments

1reaction

deadeyegoodwincommented, May 27, 2020

Thanks for the detailed bug report, we will take a look.

0reactions

CoderHamcommented, Jun 22, 2020

@katie-cathy-hunt please verify the host system has its CUDA environment set up correctly. I am closing this ticket for now since we unable to reproduce the error with the appropriate environment. Please re-open if you still see this failure. @ethem-kinginthenorth please test the same with the upcoming 20.06 release, the V2 APIs were made more robust in this release.