Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Memory leak issue on using load/unload API to dynamic loading TensorRT model

See original GitHub issue

Description I updated my Triton server version from 21.02 to 21.08 (docker image) to develop a complete service with BLS. But I don’t have enough GPU memory to load every models at the initial. I used load/unload client API which you offered to reach the purpose. Here comes the same problem on version 21.02 and 21.08.

unload API will not exactly release model memory on all the backends (PytTrch, ONNX, TensorFlow, TensorRT). First, PyTorch and ONNX backends do not fully free up GPU memory, but wouldn’t cause a memory leak. In other words, the maximum GPU memory usage on the same model is fixed, and the GPU memory would reduce unloading the model but still occupy a large amount of memory. Second, the TensorFlow backend would not release memory at all. (p.s. In other issue says it’s TensorFlow’s bug).
Use unload/load API to dynamic load TensorRT model is not only free up GPU memory incompletely but also cause memory leak when reload the same model. what I expects is the fixed maximum of GPU memory usage when the same model reload just like Pytorch, ONNX. Although the GPU memory usage reduced when unload the TensorRT model, the memory usage will increase when reload it.

Triton Information Triton server version: 21.02 / 21.08 Triton server image: I used the image that official website offered, and I tried to build by myself, and got the problem.

To Reproduce Use version 21.08 for example.

docker run --rm -it --gpus all --name Triton_Server --shm-size=1g --ulimit memlock=-1 --ulimit stack=67108864 -p 8000:8000 -p 8001:8001 -p 8002:8002 -v /home/infor/model-repository:/models nvcr.io/nvidia/tritonserver:21.08-py3 tritonserver --model-control-mode=explicit --model-repository=/models/ --strict-model-config=false --grpc-infer-allocation-pool-size=16 --log-verbose 1
use this script(see Client-Testing-Script below) to test GPU memory on each model, and use nvidia-smi to observe the GPU memory usage. (p.s. the tested TensorRT model was converted from densenet_onnx which is provided from model-repository examples)

Describe Models and Testing Result

Initial Triton Inference server

PyTorch:
- model: ArcFace from insightface
- config: see ArcFace-Config below
- result: the first execution didn’t use unload, the second one used unload.
- log: tritonserver_log_pytorch.log
ONNX:
- model: densenet_onnx
- without unload:
- with unload:
- log: tritonserver_log_onnx.log
TensorFlow:
- model: Inception_graphdef
- without unload:
- with unload:
- log: tritonserver_log_tensorflow.log
TensorRT:
- model: densenet_trt (converted from densenet_onnx)
- result: the first execution didn’t use unload, the second one used unload.
- log: tritonserver_log_tensorrt.log

Expected behavior My Final goal is when the client request unload model, then model will fully release from the GPU memory. At least, There won’t cause memory leak on TensorRT model.

Client Testing Script

#!/usr/bin/env python
import argparse
import time
import sys

import tritonclient.grpc as grpcclient
from tritonclient.utils import InferenceServerException

if __name__ == '__main__':
    parser = argparse.ArgumentParser()
    parser.add_argument('-v',
                        '--verbose',
                        action="store_true",
                        required=False,
                        default=False,
                        help='Enable verbose output')
    parser.add_argument('-u',
                        '--url',
                        type=str,
                        required=False,
                        default='localhost:8001',
                        help='Inference server URL. Default is localhost:8001.')
    parser.add_argument(
        '-m',
        '--model_name',
        type=str,
        required=False,
        default='preprocess_inception_ensemble',
        help='Name of model. Default is preprocess_inception_ensemble.')
    parser.add_argument('-d',
                        '--close_unload_model',
                        action="store_true",
                        required=False,
                        default=False,
                        help='Iteration number')
    parser.add_argument('-l',
                        '--loop',
                        type=int,
                        required=False,
                        default=10,
                        help='Iteration number')

    FLAGS = parser.parse_args()

    for i in range(FLAGS.loop):

        print("iteration: {}".format(i+1))
        print("*" * 50)

        try:
            triton_client = grpcclient.InferenceServerClient(url=FLAGS.url,
                                                            verbose=FLAGS.verbose)
        except Exception as e:
            print("\tcontext creation failed: " + str(e))
            sys.exit(1)

        model_name = FLAGS.model_name

        load_start = time.time()
        triton_client.load_model(model_name)
        load_end   = time.time()
        print("\tLoading time: {:.2f}ms".format((load_end - load_start) * 1000))
        if not triton_client.is_model_ready(model_name):
            print('\tFAILED : Load Model')
            sys.exit(1)
        else:
            print("\tModel loading pass")
            # Make sure the model matches our requirements, and get some
            # properties of the model that we need for preprocessing
            try:
                model_metadata = triton_client.get_model_metadata(
                    model_name=FLAGS.model_name, model_version="1")
                model_config = triton_client.get_model_config(
                    model_name=FLAGS.model_name, model_version="1"
                )
                # print("model config: {}".format(model_config))
                # print("model metadata: {}".format(model_metadata))
                print("\tGet config and metadata pass")
            except InferenceServerException as e:
                print("\tfailed to retrieve the metadata or config: " + str(e))
                sys.exit(1)

        if not FLAGS.close_unload_model:
            unload_start = time.time()
            triton_client.unload_model(model_name)
            unload_end   = time.time()
            print("\tUnloading time: {:.2f}ms".format((unload_end - unload_start) * 1000))
            if triton_client.is_model_ready(model_name):
                print('\tFAILED : Unload Model')
                sys.exit(1)
            else:
                print("\tModel unloading pass")

ArcFace Config

name: "arcface_r100_torch"
platform: "pytorch_libtorch"
max_batch_size : 2
input [
  {
    name: "input__0"
    data_type: TYPE_FP32
    format: FORMAT_NCHW
    dims: [ 3 , 112, 112 ]

  }
]
output [
  {
    name: "output__0"
    data_type: TYPE_FP32
    dims: [ 512 ]
    reshape { shape: [ 1, 512 ] }
  }
]

Issue Analytics

State:
Created 2 years ago
Comments:6 (3 by maintainers)

Top GitHub Comments

1reaction

GuanLuocommented, Nov 12, 2021

In response to your points, for

unload request the framework to unload the model but it doesn’t have full control over how the frameworks manage the memory that the frameworks allocate, and your observation aligns with our understanding that for ORT and PyTorch, the unload will result in reduced memory usage but there is a base usage reserved by the framework. For TF, it is known issue that it allocates resource greedily and will not release the resource even the model is unloaded.
In investigating another instance of TRT backend memory leak, the memory leak in TRT is actually due to the autofill feature in TRT backend (--strict-model-config=false) which is likely the same issue in your case. The issue should be fixed in 21.10, do you mind to try and see if the latest release fixes the issue? If so, this issue can be closed.

0reactions

handokucommented, Dec 16, 2021

@GuanLuo Ok, thanks for your fast reply

Top Results From Across the Web

Nvidia Triton 21.08 TensorRt Memory Leak - HackMD

Use unload / load API to dynamic load TensorRT model is not only free up GPU memory incompletely but also cause memory leak...

There might be a memory leak when use dynamic inputs #351

Description I am trying to use Dynamic shapes features with TensorRT Python API. When run the code in Relevant Files gpu memory sometimes ......

GPU memory leak when using tensorrt with onnx model

Description GPU memory keeps increasing when running tensorrt inference in a for loop Environment TensorRT Version: 7.0.0.11 GPU Type: ...

How to use TensorRT C++ API for high performance GPU ...

Find the full code here: https://github.com/cyrusbehr/ tensorrt -cpp- api.

mxnet Changelog - pyup.io

Revert "Fix memory leaks in Gluon (18328) (18359)" (19181) - Improve environment variable handling in ... enable TensorRT integration with cpp api (15335)...