Memory leak issue on using load/unload API to dynamic loading TensorRT model
See original GitHub issueDescription
I updated my Triton server version from 21.02 to 21.08 (docker image) to develop a complete service with BLS. But I don’t have enough GPU memory to load every models at the initial. I used load
/unload
client API which you offered to reach the purpose. Here comes the same problem on version 21.02 and 21.08.
unload
API will not exactly release model memory on all the backends (PytTrch, ONNX, TensorFlow, TensorRT). First, PyTorch and ONNX backends do not fully free up GPU memory, but wouldn’t cause a memory leak. In other words, the maximum GPU memory usage on the same model is fixed, and the GPU memory would reduce unloading the model but still occupy a large amount of memory. Second, the TensorFlow backend would not release memory at all. (p.s. In other issue says it’s TensorFlow’s bug).- Use
unload
/load
API to dynamic load TensorRT model is not only free up GPU memory incompletely but also cause memory leak when reload the same model. what I expects is the fixed maximum of GPU memory usage when the same model reload just like Pytorch, ONNX. Although the GPU memory usage reduced when unload the TensorRT model, the memory usage will increase when reload it.
Triton Information Triton server version: 21.02 / 21.08 Triton server image: I used the image that official website offered, and I tried to build by myself, and got the problem.
To Reproduce Use version 21.08 for example.
docker run --rm -it --gpus all --name Triton_Server --shm-size=1g --ulimit memlock=-1 --ulimit stack=67108864 -p 8000:8000 -p 8001:8001 -p 8002:8002 -v /home/infor/model-repository:/models nvcr.io/nvidia/tritonserver:21.08-py3 tritonserver --model-control-mode=explicit --model-repository=/models/ --strict-model-config=false --grpc-infer-allocation-pool-size=16 --log-verbose 1
- use this script(see Client-Testing-Script below) to test GPU memory on each model, and use nvidia-smi to observe the GPU memory usage. (p.s. the tested TensorRT model was converted from densenet_onnx which is provided from model-repository examples)
Describe Models and Testing Result
- Initial Triton Inference server
-
PyTorch:
- model: ArcFace from insightface
- config: see ArcFace-Config below
- result: the first execution didn’t use
unload
, the second one usedunload
. - log: tritonserver_log_pytorch.log
-
ONNX:
- model: densenet_onnx
- without
unload
: - with
unload
: - log: tritonserver_log_onnx.log
-
TensorFlow:
- model: Inception_graphdef
- without
unload
: - with
unload
: - log: tritonserver_log_tensorflow.log
-
TensorRT:
- model: densenet_trt (converted from densenet_onnx)
- result: the first execution didn’t use
unload
, the second one usedunload
. - log: tritonserver_log_tensorrt.log
Expected behavior
My Final goal is when the client request unload
model, then model will fully release from the GPU memory. At least, There won’t cause memory leak on TensorRT model.
Client Testing Script
#!/usr/bin/env python
import argparse
import time
import sys
import tritonclient.grpc as grpcclient
from tritonclient.utils import InferenceServerException
if __name__ == '__main__':
parser = argparse.ArgumentParser()
parser.add_argument('-v',
'--verbose',
action="store_true",
required=False,
default=False,
help='Enable verbose output')
parser.add_argument('-u',
'--url',
type=str,
required=False,
default='localhost:8001',
help='Inference server URL. Default is localhost:8001.')
parser.add_argument(
'-m',
'--model_name',
type=str,
required=False,
default='preprocess_inception_ensemble',
help='Name of model. Default is preprocess_inception_ensemble.')
parser.add_argument('-d',
'--close_unload_model',
action="store_true",
required=False,
default=False,
help='Iteration number')
parser.add_argument('-l',
'--loop',
type=int,
required=False,
default=10,
help='Iteration number')
FLAGS = parser.parse_args()
for i in range(FLAGS.loop):
print("iteration: {}".format(i+1))
print("*" * 50)
try:
triton_client = grpcclient.InferenceServerClient(url=FLAGS.url,
verbose=FLAGS.verbose)
except Exception as e:
print("\tcontext creation failed: " + str(e))
sys.exit(1)
model_name = FLAGS.model_name
load_start = time.time()
triton_client.load_model(model_name)
load_end = time.time()
print("\tLoading time: {:.2f}ms".format((load_end - load_start) * 1000))
if not triton_client.is_model_ready(model_name):
print('\tFAILED : Load Model')
sys.exit(1)
else:
print("\tModel loading pass")
# Make sure the model matches our requirements, and get some
# properties of the model that we need for preprocessing
try:
model_metadata = triton_client.get_model_metadata(
model_name=FLAGS.model_name, model_version="1")
model_config = triton_client.get_model_config(
model_name=FLAGS.model_name, model_version="1"
)
# print("model config: {}".format(model_config))
# print("model metadata: {}".format(model_metadata))
print("\tGet config and metadata pass")
except InferenceServerException as e:
print("\tfailed to retrieve the metadata or config: " + str(e))
sys.exit(1)
if not FLAGS.close_unload_model:
unload_start = time.time()
triton_client.unload_model(model_name)
unload_end = time.time()
print("\tUnloading time: {:.2f}ms".format((unload_end - unload_start) * 1000))
if triton_client.is_model_ready(model_name):
print('\tFAILED : Unload Model')
sys.exit(1)
else:
print("\tModel unloading pass")
ArcFace Config
name: "arcface_r100_torch"
platform: "pytorch_libtorch"
max_batch_size : 2
input [
{
name: "input__0"
data_type: TYPE_FP32
format: FORMAT_NCHW
dims: [ 3 , 112, 112 ]
}
]
output [
{
name: "output__0"
data_type: TYPE_FP32
dims: [ 512 ]
reshape { shape: [ 1, 512 ] }
}
]
Issue Analytics
- State:
- Created 2 years ago
- Comments:6 (3 by maintainers)
Top GitHub Comments
In response to your points, for
unload
request the framework to unload the model but it doesn’t have full control over how the frameworks manage the memory that the frameworks allocate, and your observation aligns with our understanding that for ORT and PyTorch, the unload will result in reduced memory usage but there is a base usage reserved by the framework. For TF, it is known issue that it allocates resource greedily and will not release the resource even the model is unloaded.--strict-model-config=false
) which is likely the same issue in your case. The issue should be fixed in 21.10, do you mind to try and see if the latest release fixes the issue? If so, this issue can be closed.@GuanLuo Ok, thanks for your fast reply