question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

TensorRT Plan blocking GPU Memory

See original GitHub issue

TRTIS: 19.02-py3 TensorRT: 5.0.2 GPU: GTX 1080ti

I have several classification models (tensorflow) in my serving models directory. When I run TRTIS everything is working as expected. Depending on the current request, server offloads models and uploads the one being requested.

However, when I converted all my models to TensorRT plans, models are all uploaded to GPU at once. For my models 3GB GPU RAM is used per model, which allows me to serve only up to 4 models. If exceeded, I get terminate called after throwing an instance of 'nvinfer1::CudaError

I0311 14:15:13.631856 1 logging.cc:49] Glob Size is 162081436 bytes.
I0311 14:15:13.652293 1 logging.cc:49] Added linear block of size 1106380800
I0311 14:15:13.652340 1 logging.cc:49] Added linear block of size 553190400
I0311 14:15:13.652345 1 logging.cc:49] Added linear block of size 272844800
I0311 14:15:13.652350 1 logging.cc:49] Added linear block of size 250880000
I0311 14:15:13.652354 1 logging.cc:49] Added linear block of size 6554624
I0311 14:15:13.697682 1 logging.cc:49] Deserialize required 66238 microseconds.
E0311 14:15:13.700653 1 logging.cc:43] runtime.cpp (24) - Cuda Error in allocate: 2
terminate called after throwing an instance of 'nvinfer1::CudaError'
  what():  std::exception

Shouldn’t it behave the same way as in tensorflow model case? Or should I generate a smaller plan files with lower max batch size to fit all model in single GPU?

Issue Analytics

  • State:closed
  • Created 5 years ago
  • Comments:6 (4 by maintainers)

github_iconTop GitHub Comments

3reactions
deadeyegoodwincommented, Mar 13, 2019

The inference server does not load and unload models dynamically. On startup the inference server loads all models from the model repository. Some frameworks (like TRT) allocate all their required GPU memory immediately. Other frameworks (like TF) allocate a little memory on startup but allocate most of their memory dynamically as needed.

Because their memory is allocated upfront, TRT models should not run out of memory once they are loaded. TF models, however, could load but then run out of memory while performing inferencing. You should be able to see this by using many instances of a TF model (with instance_groups) and then sending enough requests to the server so that all those instances are busy (perhaps with perf_client).

You can reduce the size of a TRT model by reducing the maximum batch size.

Note that the inference server will load/unload models if you modify the model repository while the server is running. See https://docs.nvidia.com/deeplearning/sdk/tensorrt-inference-server-guide/docs/model_repository.html#modifying-the-model-repository

In the future the inference server will have another model repository management mode where models are not loaded on startup or on model repository changes. Instead there will be a “model control” API that can be used to load and unload models on demand.

0reactions
deadeyegoodwincommented, Apr 29, 2020

The API itself is OK and stable. We could probably remove the experimental label. There can be difficulties depending on the framework backend. Currently I think only TF is problematic because it allocates GPU memory chunks and never returns them to the system. It is smart and will reuse within its own allocation, but for example, if you unload all TF models the TF framework will still hold onto whatever memory it has allocated. That memory may be unused by TF at the moment, and if you load a TF model it will just use that already allocated memory instead of requesting more, but it will not give it back to the system so that another framework could use it.

Read more comments on GitHub >

github_iconTop Results From Across the Web

TensorRT Plan blocking GPU Memory · Issue #145 - GitHub
On startup the inference server loads all models from the model repository. Some frameworks (like TRT) allocate all their required GPU memory ......
Read more >
TensorRT memory management - NVIDIA Developer Forums
Everytime a new engine loading to the memory will lock a specific part of memory. However, the image processing functions also require GPU...
Read more >
Run multiple deep learning models on GPU with Amazon ...
Use trtexec to create a TensorRT engine plan from the model.onnx file. You can optionally reduce the precision of floating-point computations, ...
Read more >
Blocked Algorithms for Neural Networks - Harvard DASH
GPU programming model (blocked program, ... The number of SMs (resp. memory channels) available ... [12] NVIDIA, “The tensorrt library.
Read more >
Speeding Up Deep Learning Inference Using TensorRT
TensorRT allows you to increase GPU memory footprint during the engine building phase with the setMaxWorkspaceSize function. Increasing the ...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found