nv_gpu_memory_used_bytes metric does not decrease on model unload
See original GitHub issueFrom local testing with 20.01 the metric nv_gpu_memory_used_bytes exposed on /metrics
does not decrease on model unload. Assuming this is expected would there be some way to expose the actual memory used by the loaded models?
I ask as orchestrators (like when running inside kubernetes) might wish to determine the available memory for new models on the server.
Issue Analytics
- State:
- Created 4 years ago
- Comments:6 (3 by maintainers)
Top Results From Across the Web
nv_gpu_memory_used_bytes metric does not decrease on ...
From local testing with 20.01 the metric nv_gpu_memory_used_bytes exposed on /metrics does not decrease on model unload.
Read more >How can I solve 'ran out of gpu memory' in TensorFlow
I use train_dataset = tf.data.Dataset.from_tensor_slices((x_train, y_train)).batch(batch_size=batch_size) and then iterate as for x_batch, y_batch in ...
Read more >Improving GPU Memory Oversubscription Performance
In this post, we dive into the performance characteristics of a micro-benchmark that stresses different memory access patterns for the ...
Read more >How To Fit a Bigger Model and Train It Faster - Hugging Face
In this section we have a look at a few tricks to reduce the memory ... That looks good: the GPU memory is...
Read more >nvgpu · PyPI
Often we want to train a ML model on one of GPUs installed on a multi-GPU machine. Since TensorFlow allocates all memory, only...
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
Any deployment plan should be aware of the limitations of the different backends. TF is not a particularly good framework for inference for a couple of reasons (this memory usage policy being one of them). If you want to have a long-running “dynamic model repository” TRTIS instance where you use the model-control APIs to load and unload models, then you need to account for TF. TRTIS provides a command-line option to limit TF to a fraction of the GPU memory… but this doesn’t cause it to release its memory.
For a system like kubernetes an alternative is to use a “static model repository” TRTIS instance where TRTIS is started/initialized with some set of models and that set of models never changes. If a change is needed then use k8s to change configuration via a rolling update or similar.
Each approach has advantages and disadvantages.
Thanks for your feedback. Very useful.