Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

TensorRT Plan blocking GPU Memory

See original GitHub issue

TRTIS: 19.02-py3 TensorRT: 5.0.2 GPU: GTX 1080ti

I have several classification models (tensorflow) in my serving models directory. When I run TRTIS everything is working as expected. Depending on the current request, server offloads models and uploads the one being requested.

However, when I converted all my models to TensorRT plans, models are all uploaded to GPU at once. For my models 3GB GPU RAM is used per model, which allows me to serve only up to 4 models. If exceeded, I get terminate called after throwing an instance of 'nvinfer1::CudaError

I0311 14:15:13.631856 1 logging.cc:49] Glob Size is 162081436 bytes.
I0311 14:15:13.652293 1 logging.cc:49] Added linear block of size 1106380800
I0311 14:15:13.652340 1 logging.cc:49] Added linear block of size 553190400
I0311 14:15:13.652345 1 logging.cc:49] Added linear block of size 272844800
I0311 14:15:13.652350 1 logging.cc:49] Added linear block of size 250880000
I0311 14:15:13.652354 1 logging.cc:49] Added linear block of size 6554624
I0311 14:15:13.697682 1 logging.cc:49] Deserialize required 66238 microseconds.
E0311 14:15:13.700653 1 logging.cc:43] runtime.cpp (24) - Cuda Error in allocate: 2
terminate called after throwing an instance of 'nvinfer1::CudaError'
  what():  std::exception

Shouldn’t it behave the same way as in tensorflow model case? Or should I generate a smaller plan files with lower max batch size to fit all model in single GPU?

Issue Analytics

State:
Created 5 years ago
Comments:6 (4 by maintainers)

Top GitHub Comments

3reactions

deadeyegoodwincommented, Mar 13, 2019

The inference server does not load and unload models dynamically. On startup the inference server loads all models from the model repository. Some frameworks (like TRT) allocate all their required GPU memory immediately. Other frameworks (like TF) allocate a little memory on startup but allocate most of their memory dynamically as needed.

Because their memory is allocated upfront, TRT models should not run out of memory once they are loaded. TF models, however, could load but then run out of memory while performing inferencing. You should be able to see this by using many instances of a TF model (with instance_groups) and then sending enough requests to the server so that all those instances are busy (perhaps with perf_client).

You can reduce the size of a TRT model by reducing the maximum batch size.

Note that the inference server will load/unload models if you modify the model repository while the server is running. See https://docs.nvidia.com/deeplearning/sdk/tensorrt-inference-server-guide/docs/model_repository.html#modifying-the-model-repository

In the future the inference server will have another model repository management mode where models are not loaded on startup or on model repository changes. Instead there will be a “model control” API that can be used to load and unload models on demand.

0reactions

deadeyegoodwincommented, Apr 29, 2020

The API itself is OK and stable. We could probably remove the experimental label. There can be difficulties depending on the framework backend. Currently I think only TF is problematic because it allocates GPU memory chunks and never returns them to the system. It is smart and will reuse within its own allocation, but for example, if you unload all TF models the TF framework will still hold onto whatever memory it has allocated. That memory may be unused by TF at the moment, and if you load a TF model it will just use that already allocated memory instead of requesting more, but it will not give it back to the system so that another framework could use it.