question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

BLS script + FORCE_CPU_ONLY_INPUT_TENSORS -> output tensor from ORT is NEVER on GPU memory

See original GitHub issue

Description In a Python script (BLS), when I retrieve output of a GPU model (ONNX Runtime), it is always a CPU tensor (on 64 ONNX Runtime calls in row). DLPack is used to move tensor from/to ONNX Runtime. Because of that, I need to move the result back to CUDA memory to resend it to the model (generative language model use case). This move from CUDA -> CPU HOST -> CUDA takes time during inference. FORCE_CPU_ONLY_INPUT_TENSORS is set

The flow works (minus the memory issue) aka it doesn’t crash, etc. The result is as expected.

According to this example, there is nothing special to do to keep tensor on GPU, but the example doesn’t imply calling external model, it’s just the script calling itself. https://github.com/triton-inference-server/server/blob/fd7ceefeb5096add392b12f829b7d77b2b59b73c/qa/python_models/dlpack_io_identity/model.py

Triton Information What version of Triton are you using? nvcr.io/nvidia/tritonserver:21.12-py3

Are you using the Triton container or did you build it yourself? docker image from Nvidia repo

To Reproduce

config of BLS script (i tried with kind: KIND_GPU too… in case of…)

name: "transformer_generative_model"
max_batch_size: 0
backend: "python"

input [
    {
        name: "TEXT"
        data_type: TYPE_STRING
        dims: [ -1 ]
    }
]

output [
    {
        name: "output"
        data_type: TYPE_STRING
        dims: [ -1 ]
    }
]

instance_group [
    {
      count: 1
      kind: KIND_CPU
    }
]

parameters: {
  key: "FORCE_CPU_ONLY_INPUT_TENSORS"
  value: {
    string_value:"no"
  }
}

ONNX Runtime config

name: "transformer_onnx_model"
max_batch_size: 0
platform: "onnxruntime_onnx"
default_model_filename: "model.bin"

input [
    {
        name: "input_ids"
        data_type: TYPE_INT32
        dims: [-1, -1]
    }
]

output {
    name: "output"
    data_type: TYPE_FP32
    dims: [-1, -1, 50257]
}

instance_group [
    {
      count: 1
      kind: KIND_GPU
    }
]

The BLS script

This function is called 64 times by the BLS script with input tensor which is always on CUDA memory. There are print function calls -> they are always true (the tensor is always on CPU) which is the issue I try to fix/understand

...
        def inference_triton(input_ids: torch.Tensor) -> torch.Tensor:
            print(f"input_ids device: {input_ids.device}")  # always cuda
            input_ids = input_ids.type(dtype=torch.int32)
            inputs = [pb_utils.Tensor.from_dlpack("input_ids", torch.to_dlpack(input_ids))]
            inference_request = pb_utils.InferenceRequest(
                model_name='transformer_onnx_model',
                requested_output_names=['output'],
                inputs=inputs)
            inference_response = inference_request.exec()
            if inference_response.has_error():
                raise pb_utils.TritonModelException(inference_response.error().message())
            else:
                output = pb_utils.get_output_tensor_by_name(inference_response, 'output')
                print(f"is cpu {output.is_cpu()}")  # always true :-(
                tensor: torch.Tensor = torch.from_dlpack(output.to_dlpack())
                print(f'input device: {tensor.device}')  # always CPU :-(
                tensor = tensor.cuda()  # takes time :-(
                return tensor
...

Expected behavior

I understand that with option FORCE_CPU_ONLY_INPUT_TENSORS there is no guarantee that the output tensor from ONNX Runtime model is in CUDA memory, but I expect that it is the case most of the time, or at least some times.

Issue Analytics

  • State:closed
  • Created 2 years ago
  • Comments:8 (6 by maintainers)

github_iconTop GitHub Comments

1reaction
CoderHamcommented, Feb 1, 2022

@pommedeterresautee 22.02 will be released in a month from now.

1reaction
Tabriziancommented, Jan 27, 2022

I assume by TRT engine you mean the TensorRT backend models? I don’t think there is a similar issue with the TRT models.

Moreover, I understand that the fix is coming in Triton 22.02, may I ask if there is a (even very approximative) date for its release as a docker image?

cc @dzier

Read more comments on GitHub >

github_iconTop Results From Across the Web

Getting Rid of CPU-GPU Copies in TensorFlow - Exafunction
In this blog post, we'll show how to pass model inputs and outputs directly through GPU memory for model inferences in TensorFlow, ...
Read more >
How can I solve 'ran out of gpu memory' in TensorFlow
I was encountering out of memory errors when training a small CNN on a GTX 970. Through somewhat of a fluke, I discovered...
Read more >
Frequently Asked Questions — PyTorch 1.13 documentation
Frequently Asked Questions. My model reports “cuda runtime error(2): out of memory”. As the error message suggests, you have run out of memory...
Read more >
Memory Hygiene With TensorFlow During Model Training and ...
This image is an output of command nvidia-smi which is used to print stats. Initial GPU Memory Allocation Before Executing Any TF Based...
Read more >
Use a GPU | TensorFlow Core
Since a device was not explicitly specified for the MatMul operation, ... By default, TensorFlow maps nearly all of the GPU memory of...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found