Successfully loaded torchscript model failed with "CUDA error: CUBLAS_STATUS_NOT_INITIALIZED" when called for inference
See original GitHub issueDescription I converted a pytorch model to torchscript using the following script: https://gist.github.com/keskarnitish/1061cbd101ab186e2d80c7877517e7ee#file-saved_pytorch_model-py.
I tested the model using
import torch
model = torch.jit.load('model.pt')
example_outputs = model(example_inputs['input_ids'])
and it worked as expected.
I then deployed tritonserver:20.03-py3
on GKE on a node with T4 GPU.
I ran nvidia-smi
on the node and got:
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 418.67 Driver Version: 418.67 CUDA Version: 10.2 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 Tesla T4 Off | 00000000:00:04.0 Off | 0 |
| N/A 62C P0 32W / 70W | 3163MiB / 15079MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
+-----------------------------------------------------------------------------+
The triton server successfully loaded the model on the node. I checked the api status and it said that the model is ready.
But when I ran the perf_client, I got the following on the server logs:
I0525 05:24:42.733448 1 libtorch_backend.cc:538] Running bert with 1 request payloads
I0525 05:24:42.734669 1 pinned_memory_manager.cc:131] pinned memory allocation: size 256, addr 0x7f8a20000090
I0525 05:24:43.009041 1 libtorch_backend.cc:804] CUDA error: CUBLAS_STATUS_NOT_INITIALIZED when calling `cublasCreate(handle)`
The above operation failed in interpreter.
Traceback (most recent call last):
Serialized File "code/__torch__.py", line 9
_0 = self.model
input_ids = torch.to(data, dtype=4, layout=0, device=torch.device("cuda"), pin_memory=False, non_blocking=False, copy=False, memory_format=None)
return ((_0).forward(input_ids, ),)
~~~~~~~~~~~ <--- HERE
Serialized File "code/__torch__/transformers/modeling_bert.py", line 10, in forward
input_ids: Tensor) -> Tensor:
_0 = self.classifier
_1 = (self.dropout).forward((self.bert).forward(input_ids, ), )
~~~~~~~~~~~~~~~~~~ <--- HERE
return (_0).forward(_1, )
class BertModel(Module):
Serialized File "code/__torch__/transformers/modeling_bert.py", line 35, in forward
_12 = torch.to(extended_attention_mask, 6, False, False, None)
attention_mask0 = torch.mul(torch.rsub(_12, 1., 1), CONSTANTS.c0)
_13 = (_3).forward((_4).forward(input_ids, input, ), attention_mask0, )
~~~~~~~~~~~ <--- HERE
return (_2).forward(_13, )
class BertEmbeddings(Module):
Serialized File "code/__torch__/transformers/modeling_bert.py", line 73, in forward
attention_mask: Tensor) -> Tensor:
_26 = getattr(self.layer, "1")
_27 = (getattr(self.layer, "0")).forward(argument_1, attention_mask, )
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ <--- HERE
_28 = getattr(self.layer, "2")
_29 = (_26).forward(_27, attention_mask, )
Serialized File "code/__torch__/transformers/modeling_bert.py", line 107, in forward
_49 = self.output
_50 = self.intermediate
_51 = (self.attention).forward(argument_1, attention_mask, )
~~~~~~~~~~~~~~~~~~~~~~~ <--- HERE
_52 = (_49).forward((_50).forward(_51, ), _51, )
return _52
Serialized File "code/__torch__/transformers/modeling_bert.py", line 119, in forward
attention_mask: Tensor) -> Tensor:
_53 = self.output
_54 = (self.self).forward(argument_1, attention_mask, )
~~~~~~~~~~~~~~~~~~ <--- HERE
return (_53).forward(_54, argument_1, )
class BertSelfAttention(Module):
Serialized File "code/__torch__/transformers/modeling_bert.py", line 134, in forward
_56 = self.value
_57 = self.key
_58 = (self.query).forward(argument_1, )
~~~~~~~~~~~~~~~~~~~ <--- HERE
_59 = (_57).forward(argument_1, )
_60 = (_56).forward(argument_1, )
/usr/local/lib/python3.6/dist-packages/torch/nn/functional.py(1612): linear
/usr/local/lib/python3.6/dist-packages/torch/nn/modules/linear.py(87): forward
/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py(534): _slow_forward
/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py(548): __call__
/usr/local/lib/python3.6/dist-packages/transformers/modeling_bert.py(216): forward
/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py(534): _slow_forward
/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py(548): __call__
/usr/local/lib/python3.6/dist-packages/transformers/modeling_bert.py(314): forward
/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py(534): _slow_forward
/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py(548): __call__
/usr/local/lib/python3.6/dist-packages/transformers/modeling_bert.py(368): forward
/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py(534): _slow_forward
/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py(548): __call__
/usr/local/lib/python3.6/dist-packages/transformers/modeling_bert.py(407): forward
/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py(534): _slow_forward
/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py(548): __call__
/usr/local/lib/python3.6/dist-packages/transformers/modeling_bert.py(734): forward
/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py(534): _slow_forward
/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py(548): __call__
/usr/local/lib/python3.6/dist-packages/transformers/modeling_bert.py(1142): forward
/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py(534): _slow_forward
/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py(548): __call__
<ipython-input-2-afc347149dec>(9): forward
/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py(534): _slow_forward
/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py(548): __call__
/usr/local/lib/python3.6/dist-packages/torch/jit/__init__.py(1027): trace_module
/usr/local/lib/python3.6/dist-packages/torch/jit/__init__.py(875): trace
<ipython-input-2-afc347149dec>(13): <module>
/usr/local/lib/python3.6/dist-packages/IPython/core/interactiveshell.py(2882): run_code
/usr/local/lib/python3.6/dist-packages/IPython/core/interactiveshell.py(2822): run_ast_nodes
/usr/local/lib/python3.6/dist-packages/IPython/core/interactiveshell.py(2718): run_cell
/usr/local/lib/python3.6/dist-packages/ipykernel/zmqshell.py(537): run_cell
/usr/local/lib/python3.6/dist-packages/ipykernel/ipkernel.py(208): do_execute
/usr/local/lib/python3.6/dist-packages/ipykernel/kernelbase.py(399): execute_request
/usr/local/lib/python3.6/dist-packages/ipykernel/kernelbase.py(233): dispatch_shell
/usr/local/lib/python3.6/dist-packages/ipykernel/kernelbase.py(283): dispatcher
/usr/local/lib/python3.6/dist-packages/tornado/stack_context.py(277): null_wrapper
/usr/local/lib/python3.6/dist-packages/zmq/eventloop/zmqstream.py(438): _run_callback
/usr/local/lib/python3.6/dist-packages/zmq/eventloop/zmqstream.py(486): _handle_recv
/usr/local/lib/python3.6/dist-packages/zmq/eventloop/zmqstream.py(456): _handle_events
/usr/local/lib/python3.6/dist-packages/tornado/stack_context.py(277): null_wrapper
/usr/local/lib/python3.6/dist-packages/tornado/ioloop.py(888): start
/usr/local/lib/python3.6/dist-packages/ipykernel/kernelapp.py(499): start
/usr/local/lib/python3.6/dist-packages/traitlets/config/application.py(664): launch_instance
/usr/local/lib/python3.6/dist-packages/ipykernel_launcher.py(16): <module>
/usr/lib/python3.6/runpy.py(85): _run_code
/usr/lib/python3.6/runpy.py(193): _run_module_as_main
Serialized File "code/__torch__/torch/nn/modules/linear.py", line 9, in forward
argument_1: Tensor) -> Tensor:
_0 = self.bias
output = torch.matmul(argument_1, torch.t(self.weight))
~~~~~~~~~~~~ <--- HERE
return torch.add_(output, _0, alpha=1)
I0525 05:24:43.009080 1 pinned_memory_manager.cc:158] pinned memory deallocation: addr 0x7f8a20000090
Triton Information What version of Triton are you using? 20.03
Are you using the Triton container or did you build it yourself? Triton container
To Reproduce Steps to reproduce the behavior.
See description.
Expected behavior The server should not return any error.
Issue Analytics
- State:
- Created 3 years ago
- Comments:17 (9 by maintainers)
Top GitHub Comments
Thanks for the detailed bug report, we will take a look.
@katie-cathy-hunt please verify the host system has its CUDA environment set up correctly. I am closing this ticket for now since we unable to reproduce the error with the appropriate environment. Please re-open if you still see this failure. @ethem-kinginthenorth please test the same with the upcoming 20.06 release, the V2 APIs were made more robust in this release.