Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

BLS script + FORCE_CPU_ONLY_INPUT_TENSORS -> output tensor from ORT is NEVER on GPU memory

See original GitHub issue

Description In a Python script (BLS), when I retrieve output of a GPU model (ONNX Runtime), it is always a CPU tensor (on 64 ONNX Runtime calls in row). DLPack is used to move tensor from/to ONNX Runtime. Because of that, I need to move the result back to CUDA memory to resend it to the model (generative language model use case). This move from CUDA -> CPU HOST -> CUDA takes time during inference. FORCE_CPU_ONLY_INPUT_TENSORS is set

The flow works (minus the memory issue) aka it doesn’t crash, etc. The result is as expected.

According to this example, there is nothing special to do to keep tensor on GPU, but the example doesn’t imply calling external model, it’s just the script calling itself. https://github.com/triton-inference-server/server/blob/fd7ceefeb5096add392b12f829b7d77b2b59b73c/qa/python_models/dlpack_io_identity/model.py

Triton Information What version of Triton are you using? nvcr.io/nvidia/tritonserver:21.12-py3

Are you using the Triton container or did you build it yourself? docker image from Nvidia repo

To Reproduce

config of BLS script (i tried with kind: KIND_GPU too… in case of…)

name: "transformer_generative_model"
max_batch_size: 0
backend: "python"

input [
    {
        name: "TEXT"
        data_type: TYPE_STRING
        dims: [ -1 ]
    }
]

output [
    {
        name: "output"
        data_type: TYPE_STRING
        dims: [ -1 ]
    }
]

instance_group [
    {
      count: 1
      kind: KIND_CPU
    }
]

parameters: {
  key: "FORCE_CPU_ONLY_INPUT_TENSORS"
  value: {
    string_value:"no"
  }
}

ONNX Runtime config

name: "transformer_onnx_model"
max_batch_size: 0
platform: "onnxruntime_onnx"
default_model_filename: "model.bin"

input [
    {
        name: "input_ids"
        data_type: TYPE_INT32
        dims: [-1, -1]
    }
]

output {
    name: "output"
    data_type: TYPE_FP32
    dims: [-1, -1, 50257]
}

instance_group [
    {
      count: 1
      kind: KIND_GPU
    }
]

The BLS script

This function is called 64 times by the BLS script with input tensor which is always on CUDA memory. There are print function calls -> they are always true (the tensor is always on CPU) which is the issue I try to fix/understand

...
        def inference_triton(input_ids: torch.Tensor) -> torch.Tensor:
            print(f"input_ids device: {input_ids.device}")  # always cuda
            input_ids = input_ids.type(dtype=torch.int32)
            inputs = [pb_utils.Tensor.from_dlpack("input_ids", torch.to_dlpack(input_ids))]
            inference_request = pb_utils.InferenceRequest(
                model_name='transformer_onnx_model',
                requested_output_names=['output'],
                inputs=inputs)
            inference_response = inference_request.exec()
            if inference_response.has_error():
                raise pb_utils.TritonModelException(inference_response.error().message())
            else:
                output = pb_utils.get_output_tensor_by_name(inference_response, 'output')
                print(f"is cpu {output.is_cpu()}")  # always true :-(
                tensor: torch.Tensor = torch.from_dlpack(output.to_dlpack())
                print(f'input device: {tensor.device}')  # always CPU :-(
                tensor = tensor.cuda()  # takes time :-(
                return tensor
...

Expected behavior

I understand that with option FORCE_CPU_ONLY_INPUT_TENSORS there is no guarantee that the output tensor from ONNX Runtime model is in CUDA memory, but I expect that it is the case most of the time, or at least some times.

Issue Analytics

State:
Created 2 years ago
Comments:8 (6 by maintainers)

Top GitHub Comments

1reaction

CoderHamcommented, Feb 1, 2022

@pommedeterresautee 22.02 will be released in a month from now.

1reaction

Tabriziancommented, Jan 27, 2022

I assume by TRT engine you mean the TensorRT backend models? I don’t think there is a similar issue with the TRT models.

Moreover, I understand that the fix is coming in Triton 22.02, may I ask if there is a (even very approximative) date for its release as a docker image?

cc @dzier

Top Results From Across the Web

Getting Rid of CPU-GPU Copies in TensorFlow - Exafunction

In this blog post, we'll show how to pass model inputs and outputs directly through GPU memory for model inferences in TensorFlow, ...

How can I solve 'ran out of gpu memory' in TensorFlow

I was encountering out of memory errors when training a small CNN on a GTX 970. Through somewhat of a fluke, I discovered...

Frequently Asked Questions — PyTorch 1.13 documentation

Frequently Asked Questions. My model reports “cuda runtime error(2): out of memory”. As the error message suggests, you have run out of memory...

Memory Hygiene With TensorFlow During Model Training and ...

This image is an output of command nvidia-smi which is used to print stats. Initial GPU Memory Allocation Before Executing Any TF Based...

Use a GPU | TensorFlow Core

Since a device was not explicitly specified for the MatMul operation, ... By default, TensorFlow maps nearly all of the GPU memory of...

Troubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.

Start Free

Top Related Reddit Thread

No results found

Top Related Tweet

No results found

Top Related Dev.to Post

No results found

BLS script + FORCE_CPU_ONLY_INPUT_TENSORS -> output tensor from ORT is NEVER on GPU memory

Issue Analytics

Top GitHub Comments

Top Results From Across the Web

Top Related Medium Post

Top Related StackOverflow Question

Troubleshoot Live Code

Top Related Reddit Thread

Top Related Hackernoon Post

Top Related Tweet

Top Related Dev.to Post

Top Related Hashnode Post

UNAVAILABLE: Internal: unable to create stream: the provided PTX was compiled with an unsupported toolchain

How do I create properly formatted input_data_file for warmup