question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Ensemble model using BLS: stub unhealthy

See original GitHub issue

Description I am trying to run tokenizer+encoder for a model in triton. This is implemented using triton BLS and adding in the encoder inference call in model.py. However, this is the error message upon inferencing the ensemble model through triton python client, Stub process is unhealthy and it will be restarted.

Triton Information 22.01

Command to run docker triton:

sudo docker run -it --rm --gpus all -p9000:8000 -p9001:8001 -p9002:8002 --shm-size 4g \
-v $PWD/triton_mbart:/models [nvcr.io/nvidia/tritonserver:22.01-py3](http://nvcr.io/nvidia/tritonserver:22.01-py3) \
bash -c "pip install transformers torch==1.10.2+cu113 -f https://download.pytorch.org/whl/cu113/torch_stable.html && \
tritonserver --model-repository=/models"

To Reproduce The model repo consists of two models:

  • encoder (mbart_onnx_encoder)
  • tokenizer+encoder (mbart_tokenizer_encoder)
Encoder

config.pbtxt:

name: "mbart_onnx_encoder"
platform: "onnxruntime_onnx"
max_batch_size : 0
default_model_filename: "mbart_encoder.onnx"

input [
  {
    name: "input"
    data_type: TYPE_INT64
    dims: [ -1, -1 ]
  }
]
output [
  {
    name: "output"
    data_type: TYPE_FP32
    dims: [ -1, -1, 1024 ]
  }
]

instance_group [
    {
      count: 1
      kind: KIND_CPU
    }
]

Tokenizer+Encoder

config.pbtxt:

ame: "mbart_tokenizer_encoder"
max_batch_size: 0
backend: "python"

input [
{
    name: "TEXT"
    data_type: TYPE_STRING
    dims: [ -1 ]
}
]

output [
{
    name: "output"
    data_type: TYPE_FP32
    dims: [ -1, -1, 1024 ]
}
]

instance_group [
    {
      count: 1
      kind: KIND_GPU
    }
]
parameters: {
  key: "FORCE_CPU_ONLY_INPUT_TENSORS"
  value: {
    string_value:"no"
  }
}

model.py

import os
from typing import Dict, List

import torch
import numpy as np


try:
    # noinspection PyUnresolvedReferences
    import triton_python_backend_utils as pb_utils
except ImportError:
    pass  # triton_python_backend_utils exists only inside Triton Python backend.

from transformers import AutoTokenizer, PreTrainedTokenizer, TensorType


class TritonPythonModel:
    tokenizer: PreTrainedTokenizer

    def initialize(self, args: Dict[str, str]) -> None:
        """
        Initialize the tokenization process
        :param args: arguments from Triton config file
        """
        # more variables in https://github.com/triton-inference-server/python_backend/blob/main/src/python.cc
        path: str = os.path.join(args["model_repository"], args["model_version"])
        self.tokenizer = AutoTokenizer.from_pretrained(path)
        encoder_model = args["model_name"].replace("_tokenizer_encoder", "_onnx_encoder")

        def encoder_inference(input_ids):
            inputs = input_ids
            inference_request = pb_utils.InferenceRequest(
                model_name=encoder_model, requested_output_names=["output"], inputs=inputs
            )
            inference_response = inference_request.exec()
            return [pb_utils.get_output_tensor_by_name(inference_response, "output")]

        self.encoder_inference = encoder_inference

    def execute(self, requests) -> "List[List[pb_utils.Tensor]]":
        """
        Parse and tokenize each request
        :param requests: 1 or more requests received by Triton server.
        :return: text as input tensors
        """
        responses = []
        # for loop for batch requests (disabled in our case)
        for request in requests:
            # binary data typed back to string
            query = [t.decode("UTF-8") for t in pb_utils.get_input_tensor_by_name(request, "TEXT").as_numpy().tolist()]
            tokens: Dict[str, np.ndarray] = self.tokenizer(text=query, return_tensors=TensorType.NUMPY, return_attention_mask=False)
            # tensorrt uses int32 as input type, ort uses int64
            tokens = {k: v.astype(np.int64) for k, v in tokens.items()}
            # communicate the tokenization results to Triton server
            outputs = list()
            for input_name in self.tokenizer.model_input_names:
                tensor_input = pb_utils.Tensor(input_name, tokens[input_name])
                outputs.append(tensor_input)
                break # onnx encoder is not expecting attention masks
            tokenizer_response = pb_utils.InferenceResponse(output_tensors=outputs)
            input_ids = [pb_utils.Tensor.from_dlpack("input_ids", pb_utils.get_output_tensor_by_name(tokenizer_response, "input_ids").to_dlpack())]
            encoder_outputs = self.encoder_inference(input_ids)
            inference_response = pb_utils.InferenceResponse(output_tensors=encoder_outputs)

            responses.append(inference_response)

        return responses

Expected behavior

I have run separate inferencing for encoder and tokenizer using triton python client successfully. However, this issue is faced when calling the ensemble model,

tritonclient.utils.InferenceServerException: Failed to process the request(s) for model instance 'mbart_tokenizer_encoder_0', message: Stub process is not healthy.

Note: I have tried increasing the shm-size and have used 4g.

My end goal is to use encoder-decoder model with beam search in triton. It would be great if you could share share any suggestions/resources for this.

Issue Analytics

  • State:closed
  • Created 2 years ago
  • Comments:13 (7 by maintainers)

github_iconTop GitHub Comments

1reaction
lbjcomcommented, Dec 7, 2022

I wonder whether this issues is resolved or just closed due to inactivity. I also have similar problem with this.

  • tritonserver:22.03
  • model1: python backend, KIND_CPU
  • model2: python backend, KIND_GPU
  • These models communicate through BLS.
1reaction
Tabriziancommented, Apr 1, 2022

Thanks for providing further details. You need to check the inference response of the BLS request to make sure that it doesn’t have any error in this line:

inference_response = inference_request.exec()

I received the error below:

"unexpected inference input 'input_ids' for model 'mbart_onnx_encoder'"
Read more comments on GitHub >

github_iconTop Results From Across the Web

Ensemble Models: What Are They and When Should You Use ...
Sometimes one model isn't enough. In this guide to ensemble models, I'll walk you through how (and when) to use ensemble techniques for...
Read more >
Modeling resonant frequency of microstrip antenna based on neural ...
Resonant frequency is an important parameter in designing microstrip antenna (MSA). Selective neural network ensemble (NNE) methods based on decimal ...
Read more >
Ensemble Learning | Ensemble Techniques - Analytics Vidhya
An introductory guide on ensemble learning. In this article learn ensemble techniques and two important ensemble learning techniques- ...
Read more >
General Catalog 2011-2012
Graduate programs at USU are supervised by the dean of the School of Graduate Studies, ... (BLS) 3. APEC 3010 - Introduction to...
Read more >
Soft Computing Ensemble Models Based on Logistic ... - MDPI
The name bagging is derived from bootstrap aggregating, which is one of the first, most intuitive, and simple ensemble-based algorithms with excellent ...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found