Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

CPU memory slowly increases when reusing an InferContext object for many times

See original GitHub issue

Description I noticed that after few hours of sending 4 * 500k requests to Triton server (deployed with 4 Tensor RT models), the cpu memory increased about 2% of 32GB. I let it run for 1 day and the memory increased more. However, if I create InferContext object for every request, the memory usage didn’t go up after sending same amount of requests.

I used http protocol and synchronous api call

Triton Information Server: nvcr.io/nvidia/tritonserver:20.03-py3 Client: nvcr.io/nvidia/tritonserver:20.03-py3-clientsdk

Are you using the Triton container or did you build it yourself? Triton container

To Reproduce Here is the class I used for triton client to send request to server. Note that I used 4 TritonClient objects for 4 models

import sys
import logging
import tensorrtserver.api as triton


def get_model_info(url, protocol, model_name, verbose=False):
    ctx = triton.ServerStatusContext(url, protocol, model_name, verbose)
    server_status = ctx.get_server_status()

    if model_name not in server_status.model_status:
        raise Exception("unable to get status for {}".format(model_name))

    status = server_status.model_status[model_name]
    config = status.config

    input_nodes = config.input
    output_nodes = config.output

    return input_nodes, output_nodes


class TritonClient:
    def __init__(self, url, protocol, model_name, model_version, verbose=False):
        self.url = url
        self.protocol = triton.ProtocolType.from_str(protocol)
        self.model_name = model_name
        self.model_version = model_version
        input_nodes, output_nodes = get_model_info(self.url, self.protocol, self.model_name)
        self.input_name = input_nodes[0].name
        self.output_names = []
        for output in output_nodes:
            self.output_names.append(output.name)
        self.trt_ctx = triton.InferContext(self.url, self.protocol, self.model_name, self.model_version, verbose=verbose) 
       # **move this line to do_inference will resolve the memory increasing**

        self.output_dict = {}
        for i in range(len(self.output_names)):
            self.output_dict[self.output_names[i]] = triton.InferContext.ResultFormat.RAW

    def do_inference(self, x: list, keep_name=False):
        batch_size = len(x)
        try:
            output = self.trt_ctx.run({self.input_name: x}, self.output_dict, batch_size)
        except triton.InferenceServerException as e:
            logging.info(e)
            sys.exit()

        if not keep_name:
            return [output[self.output_names[i]] for i in range(len(self.output_names))]
        return output

Expected behavior CPU memory should not increase if reusing same InferContext object for different requests

Issue Analytics

State:
Created 3 years ago
Comments:15 (5 by maintainers)

Top GitHub Comments

1reaction

deadeyegoodwincommented, Jul 8, 2020

The growth is mostly due to the underlying frameworks growing. We are moving Triton to a arch where it will be easier to remove unwanted frameworks from the container. You can actually do it now by using a multistage build and pulling over only the parts you want… but it can be tricky if you are familiar with Docker.

20.03 is only V1, so you could use 20.06-v1 client with it. Once V2 matures a little more we will take it out of beta and will then have some backwards compatibility guarantees for V2, but for now you should use V2 clients and server from the same release.

0reactions

CoderHamcommented, Aug 6, 2020

Please try with 20.07 and re-open if you still see the issue.

Top Results From Across the Web

CPU memory slowly increases when reusing an InferContext ...

I let it run for 1 day and the memory increased more. However, if I create InferContext object for every request, the memory...

Towards Wearable Cognitive Assistance - DTIC

Cognitive decline can manifest itself in many ways, including the inability to recognize people, locations and objects, loss of short- and long-term memory, ......

Context-Aware Collaborative Filtering Using Context Similarity

One of the major solutions to alleviate the sparsity issue is measuring the similarity of contexts and utilizing rating profiles with similar contexts...

Going Beyond the Desktop Computer with an Attitude

ABSTRACT. This dissertation is based upon the work within a number of research projects, five of which are presented in detail.

UCLA Electronic Theses and Dissertations - eScholarship

60% energy saving of off-loading inference tasks from the CPU to the DSP; and (3) ... Additionally, they often need to run in...