Python backend cannot support KIND_GPU in model config
See original GitHub issueDescription I am using the python backend and the code layout is as below:
root@a2719af22867:/server/docs/examples/demo_model_repository# tree
.
`-- pycuda
|-- 1
| |-- model.py
| `-- triton_python_backend_utils.py
`-- config.pbtxt
I am using the PyCuda
package, and the model.py
is as below:
import pycuda.autoinit
import pycuda.driver as drv
import numpy as np
from timeit import default_timer as timer
from pycuda.compiler import SourceModule
import sys
import json
import triton_python_backend_utils as pb_utils
class TritonPythonModel:
"""Your Python model must use the same class name. Every Python model
that is created must have "TritonPythonModel" as the class name.
"""
def initialize(self, args):
"""`initialize` is called only once when the model is being loaded.
Implementing `initialize` function is optional. This function allows
the model to intialize any state associated with this model.
Parameters
----------
args : dict
Both keys and values are strings. The dictionary keys and values are:
* model_config: A JSON string containing the model configuration
* model_instance_kind: A string containing model instance kind
* model_instance_device_id: A string containing model instance device ID
* model_repository: Model repository path
* model_version: Model version
* model_name: Model name
"""
mod = SourceModule("""
__global__ void func(float *a, float *b, float *c, size_t N)
{
const int i = blockIdx.x * blockDim.x + threadIdx.x;
if (i < N)
{
c[i] = a[i] + b[i];
}
}
""")
self.func = mod.get_function("func")
# You must parse model_config. JSON string is not parsed here
self.model_config = model_config = json.loads(args['model_config'])
# Get OUTPUT0 configuration
output0_config = pb_utils.get_output_config_by_name(
model_config, "OUTPUT0")
# Convert Triton types to numpy types
self.output0_dtype = pb_utils.triton_string_to_numpy(
output0_config['data_type'])
def execute(self, requests):
"""`execute` MUST be implemented in every Python model. `execute`
function receives a list of pb_utils.InferenceRequest as the only
argument. This function is called when an inference request is made
for this model. Depending on the batching configuration (e.g. Dynamic
Batching) used, `requests` may contain multiple requests. Every
Python model, must create one pb_utils.InferenceResponse for every
pb_utils.InferenceRequest in `requests`. If there is an error, you can
set the error argument when creating a pb_utils.InferenceResponse
Parameters
----------
requests : list
A list of pb_utils.InferenceRequest
Returns
-------
list
A list of pb_utils.InferenceResponse. The length of this list must
be the same as `requests`
"""
output0_dtype = self.output0_dtype
responses = []
# Every Python backend must iterate over everyone of the requests
# and create a pb_utils.InferenceResponse for each of them.
for request in requests:
# Get INPUT0
in_0 = pb_utils.get_input_tensor_by_name(request, "INPUT0")
# Get INPUT1
in_1 = pb_utils.get_input_tensor_by_name(request, "INPUT1")
in_0 = in_0.as_numpy()
in_1 = in_1.as_numpy()
N = in_0.shape[0]
out_0 = np.zeros(N)
# GPU run
nTheads = 256
nBlocks = int( ( N + nTheads - 1 ) / nTheads )
start = timer()
self.func(drv.In(in_0), drv.In(in_1), drv.Out(out_0), N, block=( nTheads, 1, 1 ), grid=( nBlocks, 1 ) )
run_time = timer() - start
print("gpu run time %f seconds " % run_time)
# Create output tensors. You need pb_utils.Tensor
# objects to create pb_utils.InferenceResponse.
out_tensor_0 = pb_utils.Tensor("OUTPUT0",
out_0.astype(output0_dtype))
# Create InferenceResponse. You can set an error here in case
# there was a problem with handling this inference request.
# Below is an example of how you can set errors in inference
# response:
#
# pb_utils.InferenceResponse(
# output_tensors=..., TritonError("An error occured"))
inference_response = pb_utils.InferenceResponse(
output_tensors=[out_tensor_0])
responses.append(inference_response)
# You should return a list of pb_utils.InferenceResponse. Length
# of this list must match the length of `requests` list.
return responses
def finalize(self):
"""`finalize` is called only once when the model is being unloaded.
Implementing `finalize` function is OPTIONAL. This function allows
the model to perform any necessary clean ups before exit.
"""
print('Cleaning up...')
The config.pbtxt
is as below:
name: "pycuda"
backend: "python"
input [
{
name: "INPUT0"
data_type: TYPE_FP32
dims: [-1]
}
]
input [
{
name: "INPUT1"
data_type: TYPE_FP32
dims: [-1]
}
]
output [
{
name: "OUTPUT0"
data_type: TYPE_FP32
dims: [-1]
}
]
instance_group [ { kind: KIND_GPU }]
However, when I run tritonserver --model-repository=/server/docs/examples/demo_model_repository
, I just cannot start the server:
root@a2719af22867:/server/docs/examples/demo_model_repository# tritonserver --model-repository=/server/docs/examples/demo_model_repository
I1022 12:16:34.293417 2433 metrics.cc:184] found 1 GPUs supporting NVML metrics
I1022 12:16:34.298768 2433 metrics.cc:193] GPU 0: GeForce RTX 2080 Ti
I1022 12:16:34.298944 2433 server.cc:120] Initializing Triton Inference Server
I1022 12:16:34.298950 2433 server.cc:121] id: 'triton'
I1022 12:16:34.298953 2433 server.cc:122] version: '2.3.0'
I1022 12:16:34.298956 2433 server.cc:128] extensions: classification sequence model_repository schedule_policy model_configuration system_shared_memory cuda_shared_memory binary_tensor_data statistics
I1022 12:16:34.452904 2433 pinned_memory_manager.cc:195] Pinned memory pool is created at '0x7ff7b6000000' with size 268435456
I1022 12:16:34.453240 2433 cuda_memory_manager.cc:98] CUDA memory pool is created on device 0 with size 67108864
I1022 12:16:34.454609 2433 model_repository_manager.cc:714] loading: pycuda:1
Terminated
If I change the kind to KIND_CPU
, I get the error:
root@a2719af22867:/server/docs/examples/demo_model_repository# tritonserver --model-repository=/server/docs/examples/demo_model_repository
I1022 12:18:56.717615 2440 metrics.cc:184] found 1 GPUs supporting NVML metrics
I1022 12:18:56.722966 2440 metrics.cc:193] GPU 0: GeForce RTX 2080 Ti
I1022 12:18:56.723140 2440 server.cc:120] Initializing Triton Inference Server
I1022 12:18:56.723146 2440 server.cc:121] id: 'triton'
I1022 12:18:56.723149 2440 server.cc:122] version: '2.3.0'
I1022 12:18:56.723153 2440 server.cc:128] extensions: classification sequence model_repository schedule_policy model_configuration system_shared_memory cuda_shared_memory binary_tensor_data statistics
I1022 12:18:56.892996 2440 pinned_memory_manager.cc:195] Pinned memory pool is created at '0x7fcc3c000000' with size 268435456
I1022 12:18:56.893324 2440 cuda_memory_manager.cc:98] CUDA memory pool is created on device 0 with size 67108864
I1022 12:18:56.894667 2440 model_repository_manager.cc:714] loading: pycuda:1
E1022 12:19:01.976813 2440 model_repository_manager.cc:899] failed to load 'pycuda' version 1: Internal: Exception calling application: error invoking 'nvcc --version': [Errno 2] No such file or directory: 'nvcc': 'nvcc'
I1022 12:19:01.976959 2440 server.cc:213] Waiting for in-flight requests to complete.
I1022 12:19:01.977016 2440 server.cc:228] Timeout 30: Found 0 live models and 0 in-flight non-inference requests
error: creating server: Internal - failed to load all models
root@a2719af22867:/server/docs/examples/demo_model_repository# nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2020 NVIDIA Corporation
Built on Wed_Jul_22_19:09:09_PDT_2020
Cuda compilation tools, release 11.0, V11.0.221
Build cuda_11.0_bu.TC445_37.28845127_0
Triton Information What version of Triton are you using? r20.09 Are you using the Triton container or did you build it yourself? Triton container.
To Reproduce Steps to reproduce the behavior. As above.
Describe the models (framework, inputs, outputs), ideally include the model configuration file (if using an ensemble include the model configuration file for that as well). As above.
Expected behavior
A clear and concise description of what you expected to happen.
Python backend should support instance group with KIND_GPU
Issue Analytics
- State:
- Created 3 years ago
- Comments:9 (5 by maintainers)
Top GitHub Comments
@Tabrizian I agree with @KingsleyLiu-NV that whether KIND_GPU is supported should be determined by the model but not the backend, the model config is used to instruct how the model is deployed and it should be the model’s responsibility to follow the model config and return errors if it can’t be satisfied.
@KingsleyLiu-NV Thank you for your complete report. I can confirm that there is a bug in the Python backend that the shell environment variables are not available in the Python models. This will be fixed soon. I will update you about the fix when it is available here. Regarding the
KIND_CPU/KIND_GPU
currently, it only accepts KIND_CPU but the model can use GPU. This will also be fixed so that both of the values are accepted. KIND_CPU or KIND_GPU does not affect the functionality of the Python backend at all. These variables are passed into theargs
variable so that your python model can decide how it is going to be handled.