BLS script + FORCE_CPU_ONLY_INPUT_TENSORS -> output tensor from ORT is NEVER on GPU memory
See original GitHub issueDescription
In a Python script (BLS), when I retrieve output of a GPU model (ONNX Runtime), it is always a CPU tensor (on 64 ONNX Runtime calls in row).
DLPack
is used to move tensor from/to ONNX Runtime.
Because of that, I need to move the result back to CUDA memory to resend it to the model (generative language model use case).
This move from CUDA -> CPU HOST -> CUDA takes time during inference.
FORCE_CPU_ONLY_INPUT_TENSORS
is set
The flow works (minus the memory issue) aka it doesn’t crash, etc. The result is as expected.
According to this example, there is nothing special to do to keep tensor on GPU, but the example doesn’t imply calling external model, it’s just the script calling itself. https://github.com/triton-inference-server/server/blob/fd7ceefeb5096add392b12f829b7d77b2b59b73c/qa/python_models/dlpack_io_identity/model.py
Triton Information
What version of Triton are you using? nvcr.io/nvidia/tritonserver:21.12-py3
Are you using the Triton container or did you build it yourself? docker image from Nvidia repo
To Reproduce
config of BLS script (i tried with kind: KIND_GPU
too… in case of…)
name: "transformer_generative_model"
max_batch_size: 0
backend: "python"
input [
{
name: "TEXT"
data_type: TYPE_STRING
dims: [ -1 ]
}
]
output [
{
name: "output"
data_type: TYPE_STRING
dims: [ -1 ]
}
]
instance_group [
{
count: 1
kind: KIND_CPU
}
]
parameters: {
key: "FORCE_CPU_ONLY_INPUT_TENSORS"
value: {
string_value:"no"
}
}
ONNX Runtime config
name: "transformer_onnx_model"
max_batch_size: 0
platform: "onnxruntime_onnx"
default_model_filename: "model.bin"
input [
{
name: "input_ids"
data_type: TYPE_INT32
dims: [-1, -1]
}
]
output {
name: "output"
data_type: TYPE_FP32
dims: [-1, -1, 50257]
}
instance_group [
{
count: 1
kind: KIND_GPU
}
]
The BLS script
This function is called 64 times by the BLS script with input tensor which is always on CUDA memory. There are print function calls -> they are always true (the tensor is always on CPU) which is the issue I try to fix/understand
...
def inference_triton(input_ids: torch.Tensor) -> torch.Tensor:
print(f"input_ids device: {input_ids.device}") # always cuda
input_ids = input_ids.type(dtype=torch.int32)
inputs = [pb_utils.Tensor.from_dlpack("input_ids", torch.to_dlpack(input_ids))]
inference_request = pb_utils.InferenceRequest(
model_name='transformer_onnx_model',
requested_output_names=['output'],
inputs=inputs)
inference_response = inference_request.exec()
if inference_response.has_error():
raise pb_utils.TritonModelException(inference_response.error().message())
else:
output = pb_utils.get_output_tensor_by_name(inference_response, 'output')
print(f"is cpu {output.is_cpu()}") # always true :-(
tensor: torch.Tensor = torch.from_dlpack(output.to_dlpack())
print(f'input device: {tensor.device}') # always CPU :-(
tensor = tensor.cuda() # takes time :-(
return tensor
...
Expected behavior
I understand that with option FORCE_CPU_ONLY_INPUT_TENSORS
there is no guarantee that the output tensor from ONNX Runtime model is in CUDA memory, but I expect that it is the case most of the time, or at least some times.
Issue Analytics
- State:
- Created 2 years ago
- Comments:8 (6 by maintainers)
Top GitHub Comments
@pommedeterresautee 22.02 will be released in a month from now.
I assume by TRT engine you mean the TensorRT backend models? I don’t think there is a similar issue with the TRT models.
cc @dzier