Ensemble model using BLS: stub unhealthy
See original GitHub issueDescription
I am trying to run tokenizer+encoder for a model in triton. This is implemented using triton BLS and adding in the encoder inference call in model.py. However, this is the error message upon inferencing the ensemble model through triton python client,
Stub process is unhealthy and it will be restarted.
Triton Information 22.01
Command to run docker triton:
sudo docker run -it --rm --gpus all -p9000:8000 -p9001:8001 -p9002:8002 --shm-size 4g \
-v $PWD/triton_mbart:/models [nvcr.io/nvidia/tritonserver:22.01-py3](http://nvcr.io/nvidia/tritonserver:22.01-py3) \
bash -c "pip install transformers torch==1.10.2+cu113 -f https://download.pytorch.org/whl/cu113/torch_stable.html && \
tritonserver --model-repository=/models"
To Reproduce The model repo consists of two models:
- encoder (mbart_onnx_encoder)
- tokenizer+encoder (mbart_tokenizer_encoder)
Encoder
config.pbtxt:
name: "mbart_onnx_encoder"
platform: "onnxruntime_onnx"
max_batch_size : 0
default_model_filename: "mbart_encoder.onnx"
input [
{
name: "input"
data_type: TYPE_INT64
dims: [ -1, -1 ]
}
]
output [
{
name: "output"
data_type: TYPE_FP32
dims: [ -1, -1, 1024 ]
}
]
instance_group [
{
count: 1
kind: KIND_CPU
}
]
Tokenizer+Encoder
config.pbtxt:
ame: "mbart_tokenizer_encoder"
max_batch_size: 0
backend: "python"
input [
{
name: "TEXT"
data_type: TYPE_STRING
dims: [ -1 ]
}
]
output [
{
name: "output"
data_type: TYPE_FP32
dims: [ -1, -1, 1024 ]
}
]
instance_group [
{
count: 1
kind: KIND_GPU
}
]
parameters: {
key: "FORCE_CPU_ONLY_INPUT_TENSORS"
value: {
string_value:"no"
}
}
model.py
import os
from typing import Dict, List
import torch
import numpy as np
try:
# noinspection PyUnresolvedReferences
import triton_python_backend_utils as pb_utils
except ImportError:
pass # triton_python_backend_utils exists only inside Triton Python backend.
from transformers import AutoTokenizer, PreTrainedTokenizer, TensorType
class TritonPythonModel:
tokenizer: PreTrainedTokenizer
def initialize(self, args: Dict[str, str]) -> None:
"""
Initialize the tokenization process
:param args: arguments from Triton config file
"""
# more variables in https://github.com/triton-inference-server/python_backend/blob/main/src/python.cc
path: str = os.path.join(args["model_repository"], args["model_version"])
self.tokenizer = AutoTokenizer.from_pretrained(path)
encoder_model = args["model_name"].replace("_tokenizer_encoder", "_onnx_encoder")
def encoder_inference(input_ids):
inputs = input_ids
inference_request = pb_utils.InferenceRequest(
model_name=encoder_model, requested_output_names=["output"], inputs=inputs
)
inference_response = inference_request.exec()
return [pb_utils.get_output_tensor_by_name(inference_response, "output")]
self.encoder_inference = encoder_inference
def execute(self, requests) -> "List[List[pb_utils.Tensor]]":
"""
Parse and tokenize each request
:param requests: 1 or more requests received by Triton server.
:return: text as input tensors
"""
responses = []
# for loop for batch requests (disabled in our case)
for request in requests:
# binary data typed back to string
query = [t.decode("UTF-8") for t in pb_utils.get_input_tensor_by_name(request, "TEXT").as_numpy().tolist()]
tokens: Dict[str, np.ndarray] = self.tokenizer(text=query, return_tensors=TensorType.NUMPY, return_attention_mask=False)
# tensorrt uses int32 as input type, ort uses int64
tokens = {k: v.astype(np.int64) for k, v in tokens.items()}
# communicate the tokenization results to Triton server
outputs = list()
for input_name in self.tokenizer.model_input_names:
tensor_input = pb_utils.Tensor(input_name, tokens[input_name])
outputs.append(tensor_input)
break # onnx encoder is not expecting attention masks
tokenizer_response = pb_utils.InferenceResponse(output_tensors=outputs)
input_ids = [pb_utils.Tensor.from_dlpack("input_ids", pb_utils.get_output_tensor_by_name(tokenizer_response, "input_ids").to_dlpack())]
encoder_outputs = self.encoder_inference(input_ids)
inference_response = pb_utils.InferenceResponse(output_tensors=encoder_outputs)
responses.append(inference_response)
return responses
Expected behavior
I have run separate inferencing for encoder and tokenizer using triton python client successfully. However, this issue is faced when calling the ensemble model,
tritonclient.utils.InferenceServerException: Failed to process the request(s) for model instance 'mbart_tokenizer_encoder_0', message: Stub process is not healthy.
Note: I have tried increasing the shm-size and have used 4g.
My end goal is to use encoder-decoder model with beam search in triton. It would be great if you could share share any suggestions/resources for this.
Issue Analytics
- State:
- Created 2 years ago
- Comments:13 (7 by maintainers)
Top GitHub Comments
I wonder whether this issues is resolved or just closed due to inactivity. I also have similar problem with this.
Thanks for providing further details. You need to check the inference response of the BLS request to make sure that it doesn’t have any error in this line:
I received the error below: