Not able to load the BertForSequenceClassification model from huggingface
See original GitHub issueDescription Hi, i am not able to load the BertForSequenceClassification model from huggingface in the triton inference server
Triton Information v21.09
Are you using the Triton container or did you build it yourself? I am using the triton container nvcr.io/nvidia/tritonserver:21.09-py3
To Reproduce I used the following script to convert the bert model to traced model. And i wanted to load this model in the triton inference server.
import torch
from transformers import BertForSequenceClassification, BertTokenizer
input_ids = torch.tensor([[2,3,4,5]]).long()
mask = input_ids != 1
mask = mask.long()
class MyModel(torch.nn.Module):
def __init__(self):
super().__init__()
self.model = BertForSequenceClassification.from_pretrained("bert-base-uncased")
def forward(self,data, attention_mask):
out2 = self.model(data, attention_mask=attention_mask)
return out2[0]
pt_model = MyModel().eval()
print(pt_model(input_ids, mask))
traced_model = torch.jit.trace(pt_model,(input_ids,mask))
traced_model.save("model.pt")
the config.pbtxt file is
name: "bert"
platform: "pytorch_libtorch"
input [
{
name: "input__0"
data_type: TYPE_INT32
dims: [1, 512]
},
input [
{
name: "input__1"
data_type: TYPE_INT32
dims: [1, 512]
}
]
output {
name: "output__0"
data_type: TYPE_FP32
dims: [1, 2]
}
And now runned the command
sudo docker run --rm -p8000:8000 -p8001:8001 -p8002:8002 -v /home/rkoy/server/model_repository:/models nvcr.io/nvidia/tritonserver:21.09-py3 tritonserver --model-repository=/models
I got this error
UNAVAILABLE: Internal: failed to load model 'bert': |
| | | Arguments for call are not valid. |
| | | The following variants are available: |
| | | |
| | | aten::gelu(Tensor self, bool approximate) -> (Tensor): |
| | | Argument approximate not provided. |
| | | |
| | | aten::gelu.out(Tensor self, bool approximate, *, Tensor(a!) out) -> (Tensor(a!)): |
| | | Argument approximate not provided. |
| | | |
| | | The original call is: |
| | | /data/rkoy/anaconda3/lib/python3.8/site-packages/torch/nn/functional.py(1313): gelu |
| | | /data/rkoy/anaconda3/lib/python3.8/site-packages/transformers/models/bert/modeling_bert.py(425): forward |
| | | /data/rkoy/anaconda3/lib/python3.8/site-packages/torch/nn/modules/module.py(534): _slow_forward |
| | | /data/rkoy/anaconda3/lib/python3.8/site-packages/torch/nn/modules/module.py(548): __call__ |
| | | /data/rkoy/anaconda3/lib/python3.8/site-packages/transformers/models/bert/modeling_bert.py(523): feed_forward_chunk |
| | | /data/rkoy/anaconda3/lib/python3.8/site-packages/transformers/modeling_utils.py(2349): apply_chunking_to_forward |
| | | /data/rkoy/anaconda3/lib/python3.8/site-packages/transformers/models/bert/modeling_bert.py(511): forward |
| | | /data/rkoy/anaconda3/lib/python3.8/site-packages/torch/nn/modules/module.py(534): _slow_forward |
| | | /data/rkoy/anaconda3/lib/python3.8/site-packages/torch/nn/modules/module.py(548): __call__ |
| | | /data/rkoy/anaconda3/lib/python3.8/site-packages/transformers/models/bert/modeling_bert.py(583): forward |
| | | /data/rkoy/anaconda3/lib/python3.8/site-packages/torch/nn/modules/module.py(534): _slow_forward |
| | | /data/rkoy/anaconda3/lib/python3.8/site-packages/torch/nn/modules/module.py(548): __call__ |
| | | /data/rkoy/anaconda3/lib/python3.8/site-packages/transformers/models/bert/modeling_bert.py(996): forward | | | | /data/rkoy/anaconda3/lib/python3.8/site-packages/torch/nn/modules/module.py(534): _slow_forward |
| | | /data/rkoy/anaconda3/lib/python3.8/site-packages/torch/nn/modules/module.py(548): __call__ |
| | | /data/rkoy/anaconda3/lib/python3.8/site-packages/transformers/models/bert/modeling_bert.py(1530): forward |
| | | /data/rkoy/anaconda3/lib/python3.8/site-packages/torch/nn/modules/module.py(534): _slow_forward |
| | | /data/rkoy/anaconda3/lib/python3.8/site-packages/torch/nn/modules/module.py(548): __call__ |
| | | convert_pytorch_model_to_jit.py(17): forward |
| | | /data/rkoy/anaconda3/lib/python3.8/site-packages/torch/nn/modules/module.py(534): _slow_forward |
| | | /data/rkoy/anaconda3/lib/python3.8/site-packages/torch/nn/modules/module.py(548): __call__ |
| | | /data/rkoy/anaconda3/lib/python3.8/site-packages/torch/jit/__init__.py(1027): trace_module |
| | | /data/rkoy/anaconda3/lib/python3.8/site-packages/torch/jit/__init__.py(873): trace |
| | | convert_pytorch_model_to_jit.py(25): <module> |
| | | Serialized File "code/__torch__/transformers/models/bert/modeling_bert.py", line 178 |
| | | def forward(self: __torch__.transformers.models.bert.modeling_bert.BertIntermediate, |
| | | argument_1: Tensor) -> Tensor: |
| | | input = torch.gelu((self.dense).forward(argument_1, )) |
| | | ~~~~~~~~~~ <--- HERE |
| | | return input |
| | | class BertOutput(Module):
Expected behavior I expected that the bert model would have been loaded successfully without any error
Would like to know to mistake i am making and also a solution for it.
Issue Analytics
- State:
- Created 2 years ago
- Comments:13 (5 by maintainers)
Top GitHub Comments
Closing issue due to lack of activity. Please re-open the issue if you would like to follow up with this issue.
Hi. Having the same problem running
nvcr.io/nvidia/tritonserver:21.11-py3
andtransformers==2.10
(I know it’s kinda old)Have managed to “solve” this by removing the
if
in the transformers code here and just hard coding thegelu = _gelu_python
This allowed me to run the model. It even seems to run fine. There are some small differences between the predictions of the PyTorch model and the the TritonServer one, but they are negligible in my case. The speed seems almost the same. In my case the Triton server is even sometimes slower.
But I have only been testing with the
dynamic batching
option turned off and I already had some internal code that turns text into torch.Tensor. So, in order to use the Triton model I have to convert every Tensor into numpy array first. + the GRPC delay.In my case, a PyTorch version takes ~1.61 seconds to run 1973 items (batches of size 100), Triton takes ~1.64 seconds. However, when we split the data into randomly sized (1-100) batches the times differ more: 2.384 PyTorch vs 2.75 Triron The random seed used to split the batches was fixed, and I have run 10 experiments, these are the mean values. The std is quite low - 0.02 seconds in both cases.