CPU memory slowly increases when reusing an InferContext object for many times
See original GitHub issueDescription I noticed that after few hours of sending 4 * 500k requests to Triton server (deployed with 4 Tensor RT models), the cpu memory increased about 2% of 32GB. I let it run for 1 day and the memory increased more. However, if I create InferContext object for every request, the memory usage didn’t go up after sending same amount of requests.
I used http protocol and synchronous api call
Triton Information Server: nvcr.io/nvidia/tritonserver:20.03-py3 Client: nvcr.io/nvidia/tritonserver:20.03-py3-clientsdk
Are you using the Triton container or did you build it yourself? Triton container
To Reproduce Here is the class I used for triton client to send request to server. Note that I used 4 TritonClient objects for 4 models
import sys
import logging
import tensorrtserver.api as triton
def get_model_info(url, protocol, model_name, verbose=False):
ctx = triton.ServerStatusContext(url, protocol, model_name, verbose)
server_status = ctx.get_server_status()
if model_name not in server_status.model_status:
raise Exception("unable to get status for {}".format(model_name))
status = server_status.model_status[model_name]
config = status.config
input_nodes = config.input
output_nodes = config.output
return input_nodes, output_nodes
class TritonClient:
def __init__(self, url, protocol, model_name, model_version, verbose=False):
self.url = url
self.protocol = triton.ProtocolType.from_str(protocol)
self.model_name = model_name
self.model_version = model_version
input_nodes, output_nodes = get_model_info(self.url, self.protocol, self.model_name)
self.input_name = input_nodes[0].name
self.output_names = []
for output in output_nodes:
self.output_names.append(output.name)
self.trt_ctx = triton.InferContext(self.url, self.protocol, self.model_name, self.model_version, verbose=verbose)
# **move this line to do_inference will resolve the memory increasing**
self.output_dict = {}
for i in range(len(self.output_names)):
self.output_dict[self.output_names[i]] = triton.InferContext.ResultFormat.RAW
def do_inference(self, x: list, keep_name=False):
batch_size = len(x)
try:
output = self.trt_ctx.run({self.input_name: x}, self.output_dict, batch_size)
except triton.InferenceServerException as e:
logging.info(e)
sys.exit()
if not keep_name:
return [output[self.output_names[i]] for i in range(len(self.output_names))]
return output
Expected behavior CPU memory should not increase if reusing same InferContext object for different requests
Issue Analytics
- State:
- Created 3 years ago
- Comments:15 (5 by maintainers)
Top GitHub Comments
The growth is mostly due to the underlying frameworks growing. We are moving Triton to a arch where it will be easier to remove unwanted frameworks from the container. You can actually do it now by using a multistage build and pulling over only the parts you want… but it can be tricky if you are familiar with Docker.
20.03 is only V1, so you could use 20.06-v1 client with it. Once V2 matures a little more we will take it out of beta and will then have some backwards compatibility guarantees for V2, but for now you should use V2 clients and server from the same release.
Please try with 20.07 and re-open if you still see the issue.