question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Python backend cannot support KIND_GPU in model config

See original GitHub issue

Description I am using the python backend and the code layout is as below:

root@a2719af22867:/server/docs/examples/demo_model_repository# tree
.
`-- pycuda
    |-- 1
    |   |-- model.py
    |   `-- triton_python_backend_utils.py
    `-- config.pbtxt

I am using the PyCuda package, and the model.py is as below:

import pycuda.autoinit
import pycuda.driver as drv
import numpy as np
from timeit import default_timer as timer
from pycuda.compiler import SourceModule

import sys
import json
import triton_python_backend_utils as pb_utils

class TritonPythonModel:
    """Your Python model must use the same class name. Every Python model
    that is created must have "TritonPythonModel" as the class name.
    """

    def initialize(self, args):
        """`initialize` is called only once when the model is being loaded.
        Implementing `initialize` function is optional. This function allows
        the model to intialize any state associated with this model.
        Parameters
        ----------
        args : dict
          Both keys and values are strings. The dictionary keys and values are:
          * model_config: A JSON string containing the model configuration
          * model_instance_kind: A string containing model instance kind
          * model_instance_device_id: A string containing model instance device ID
          * model_repository: Model repository path
          * model_version: Model version
          * model_name: Model name
        """
        mod = SourceModule("""
        __global__ void func(float *a, float *b, float *c, size_t N)
        {
          const int i = blockIdx.x * blockDim.x + threadIdx.x;
          if (i < N)
          {
            c[i] = a[i] + b[i];
          }
        }
        """)

        self.func = mod.get_function("func")

        # You must parse model_config. JSON string is not parsed here
        self.model_config = model_config = json.loads(args['model_config'])

        # Get OUTPUT0 configuration
        output0_config = pb_utils.get_output_config_by_name(
            model_config, "OUTPUT0")

        # Convert Triton types to numpy types
        self.output0_dtype = pb_utils.triton_string_to_numpy(
            output0_config['data_type'])

    def execute(self, requests):
        """`execute` MUST be implemented in every Python model. `execute`
        function receives a list of pb_utils.InferenceRequest as the only
        argument. This function is called when an inference request is made
        for this model. Depending on the batching configuration (e.g. Dynamic
        Batching) used, `requests` may contain multiple requests. Every
        Python model, must create one pb_utils.InferenceResponse for every
        pb_utils.InferenceRequest in `requests`. If there is an error, you can
        set the error argument when creating a pb_utils.InferenceResponse
        Parameters
        ----------
        requests : list
          A list of pb_utils.InferenceRequest
        Returns
        -------
        list
          A list of pb_utils.InferenceResponse. The length of this list must
          be the same as `requests`
        """

        output0_dtype = self.output0_dtype
        responses = []

        # Every Python backend must iterate over everyone of the requests
        # and create a pb_utils.InferenceResponse for each of them.
        for request in requests:
            # Get INPUT0
            in_0 = pb_utils.get_input_tensor_by_name(request, "INPUT0")
            # Get INPUT1
            in_1 = pb_utils.get_input_tensor_by_name(request, "INPUT1")

            in_0 = in_0.as_numpy()
            in_1 = in_1.as_numpy()
            N = in_0.shape[0]
            out_0 = np.zeros(N)
            # GPU run
            nTheads = 256
            nBlocks = int( ( N + nTheads - 1 ) / nTheads )
            start = timer()
            self.func(drv.In(in_0), drv.In(in_1), drv.Out(out_0),  N, block=( nTheads, 1, 1 ), grid=( nBlocks, 1 ) )
            run_time = timer() - start
            print("gpu run time %f seconds " % run_time)

            # Create output tensors. You need pb_utils.Tensor
            # objects to create pb_utils.InferenceResponse.
            out_tensor_0 = pb_utils.Tensor("OUTPUT0",
                                           out_0.astype(output0_dtype))

            # Create InferenceResponse. You can set an error here in case
            # there was a problem with handling this inference request.
            # Below is an example of how you can set errors in inference
            # response:
            #
            # pb_utils.InferenceResponse(
            #    output_tensors=..., TritonError("An error occured"))
            inference_response = pb_utils.InferenceResponse(
                output_tensors=[out_tensor_0])
            responses.append(inference_response)

        # You should return a list of pb_utils.InferenceResponse. Length
        # of this list must match the length of `requests` list.
        return responses

    def finalize(self):
        """`finalize` is called only once when the model is being unloaded.
        Implementing `finalize` function is OPTIONAL. This function allows
        the model to perform any necessary clean ups before exit.
        """
        print('Cleaning up...')

The config.pbtxt is as below:


name: "pycuda"
backend: "python"

input [
  {
    name: "INPUT0"
    data_type: TYPE_FP32
    dims: [-1]

  }
]
input [
  {
    name: "INPUT1"
    data_type: TYPE_FP32
    dims: [-1]

  }
]
output [
  {
    name: "OUTPUT0"
    data_type: TYPE_FP32
    dims: [-1]
  }
]

instance_group [ { kind: KIND_GPU }]

However, when I run tritonserver --model-repository=/server/docs/examples/demo_model_repository, I just cannot start the server:

root@a2719af22867:/server/docs/examples/demo_model_repository# tritonserver --model-repository=/server/docs/examples/demo_model_repository
I1022 12:16:34.293417 2433 metrics.cc:184] found 1 GPUs supporting NVML metrics
I1022 12:16:34.298768 2433 metrics.cc:193]   GPU 0: GeForce RTX 2080 Ti
I1022 12:16:34.298944 2433 server.cc:120] Initializing Triton Inference Server
I1022 12:16:34.298950 2433 server.cc:121]   id: 'triton'
I1022 12:16:34.298953 2433 server.cc:122]   version: '2.3.0'
I1022 12:16:34.298956 2433 server.cc:128]   extensions:  classification sequence model_repository schedule_policy model_configuration system_shared_memory cuda_shared_memory binary_tensor_data statistics
I1022 12:16:34.452904 2433 pinned_memory_manager.cc:195] Pinned memory pool is created at '0x7ff7b6000000' with size 268435456
I1022 12:16:34.453240 2433 cuda_memory_manager.cc:98] CUDA memory pool is created on device 0 with size 67108864
I1022 12:16:34.454609 2433 model_repository_manager.cc:714] loading: pycuda:1
Terminated

If I change the kind to KIND_CPU, I get the error:

root@a2719af22867:/server/docs/examples/demo_model_repository# tritonserver --model-repository=/server/docs/examples/demo_model_repository
I1022 12:18:56.717615 2440 metrics.cc:184] found 1 GPUs supporting NVML metrics
I1022 12:18:56.722966 2440 metrics.cc:193]   GPU 0: GeForce RTX 2080 Ti
I1022 12:18:56.723140 2440 server.cc:120] Initializing Triton Inference Server
I1022 12:18:56.723146 2440 server.cc:121]   id: 'triton'
I1022 12:18:56.723149 2440 server.cc:122]   version: '2.3.0'
I1022 12:18:56.723153 2440 server.cc:128]   extensions:  classification sequence model_repository schedule_policy model_configuration system_shared_memory cuda_shared_memory binary_tensor_data statistics
I1022 12:18:56.892996 2440 pinned_memory_manager.cc:195] Pinned memory pool is created at '0x7fcc3c000000' with size 268435456
I1022 12:18:56.893324 2440 cuda_memory_manager.cc:98] CUDA memory pool is created on device 0 with size 67108864
I1022 12:18:56.894667 2440 model_repository_manager.cc:714] loading: pycuda:1
E1022 12:19:01.976813 2440 model_repository_manager.cc:899] failed to load 'pycuda' version 1: Internal: Exception calling application: error invoking 'nvcc --version': [Errno 2] No such file or directory: 'nvcc': 'nvcc'
I1022 12:19:01.976959 2440 server.cc:213] Waiting for in-flight requests to complete.
I1022 12:19:01.977016 2440 server.cc:228] Timeout 30: Found 0 live models and 0 in-flight non-inference requests
error: creating server: Internal - failed to load all models
root@a2719af22867:/server/docs/examples/demo_model_repository# nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2020 NVIDIA Corporation
Built on Wed_Jul_22_19:09:09_PDT_2020
Cuda compilation tools, release 11.0, V11.0.221
Build cuda_11.0_bu.TC445_37.28845127_0

Triton Information What version of Triton are you using? r20.09 Are you using the Triton container or did you build it yourself? Triton container.

To Reproduce Steps to reproduce the behavior. As above.

Describe the models (framework, inputs, outputs), ideally include the model configuration file (if using an ensemble include the model configuration file for that as well). As above.

Expected behavior A clear and concise description of what you expected to happen. Python backend should support instance group with KIND_GPU

Issue Analytics

  • State:closed
  • Created 3 years ago
  • Comments:9 (5 by maintainers)

github_iconTop GitHub Comments

3reactions
GuanLuocommented, Oct 23, 2020

@Tabrizian I agree with @KingsleyLiu-NV that whether KIND_GPU is supported should be determined by the model but not the backend, the model config is used to instruct how the model is deployed and it should be the model’s responsibility to follow the model config and return errors if it can’t be satisfied.

1reaction
Tabriziancommented, Oct 24, 2020

@KingsleyLiu-NV Thank you for your complete report. I can confirm that there is a bug in the Python backend that the shell environment variables are not available in the Python models. This will be fixed soon. I will update you about the fix when it is available here. Regarding the KIND_CPU/KIND_GPU currently, it only accepts KIND_CPU but the model can use GPU. This will also be fixed so that both of the values are accepted. KIND_CPU or KIND_GPU does not affect the functionality of the Python backend at all. These variables are passed into the args variable so that your python model can decide how it is going to be handled.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Serving a Torch-TensorRT model with Triton - PyTorch
Firstly, we setup a connection with the Triton Inference Server. # Setting up client client = httpclient.InferenceServerClient ...
Read more >
Solving AI Inference Challenges with NVIDIA Triton
Using the Python or C++ backends, you can define a custom script that can call any other model being served by Triton based...
Read more >
Serving Predictions with NVIDIA Triton | Vertex AI
Additionally, with a Triton Python backend, you can include any ... Run on CPU and GPU backends: Triton supports inference for models deployed...
Read more >
Use Triton Inference Server with Amazon SageMaker
The Triton Python backend uses shared memory (SHMEM) to connect your code to ... Support for multiple frameworks: Triton can be used to...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found