Failing to get output tensor on GPU device
See original GitHub issueDescription Hello,
I’m trying to get an output tensor on the GPU device when doing InferenceRequest from the python backend
Triton Information What version of Triton are you using? Are you using the Triton container or did you build it yourself?
nvcr.io/nvidia/tritonserver:22.06-py3
To Reproduce
Here is a simple reproduction, I just make a request to a simple ONNX graph. I print
if the output tensor is on GPU or CPU
import triton_python_backend_utils as pb_utils
import json
import asyncio
class TritonPythonModel:
def initialize(self, args):
self.model_config = json.loads(args['model_config'])
async def execute(self, requests):
responses = []
for request in requests:
in_0 = pb_utils.get_input_tensor_by_name(request, "input")
inference_response_awaits = []
infer_request = pb_utils.InferenceRequest(
model_name="onnx",
requested_output_names=["output"],
inputs=[in_0])
inference_response_awaits.append(infer_request.async_exec())
inference_responses = await asyncio.gather(
*inference_response_awaits)
for infer_response in inference_responses:
if infer_response.has_error():
raise pb_utils.TritonModelException(
infer_response.error().message())
pytorch_output0_tensor = pb_utils.get_output_tensor_by_name(
inference_responses[0], "output")
# Here we print if the tensor is on CPU or GPU
print(pytorch_output0_tensor.is_cpu())
inference_response = pb_utils.InferenceResponse(
output_tensors=[pytorch_output0_tensor])
responses.append(inference_response)
return responses
def finalize(self):
print('Cleaning up...')
name: "bls_async2"
backend: "python"
max_batch_size: 0
input [
{
name: "input"
data_type: TYPE_FP32
dims: [ 1, 3 ]
}
]
output [
{
name: "output"
data_type: TYPE_FP32
dims: [ 1, 3 ]
}
]
instance_group [
{
count: 1
kind: KIND_GPU
}
]
parameters: { key: "FORCE_CPU_ONLY_INPUT_TENSORS" value: {string_value:"no"}}
I run the server with
tritonserver --model-repository `pwd`/models --model-control-mode=poll --repository-poll-secs 2 --log-verbose 100
In the log I see it enters here https://github.com/triton-inference-server/onnxruntime_backend/blob/5568172eab065ae9bf31fe9dc1e2bed9dfc363d9/src/onnxruntime.cc#L1640
It will print True
because is_cpu
is true for the output
issue.zip
Expected behavior
is_cpu
to be false
I add a zip containing the example models, you just need to run python3 make_request.py
to run an inference
-> issue.zip
Issue Analytics
- State:
- Created a year ago
- Comments:6 (2 by maintainers)
Top GitHub Comments
@Tabrizian Works like a charm ⭐
Thanks
Thanks a lot! will try