Submitting raw data via IPC
See original GitHub issueHi, thanks for open-sourcing this project!
I experimented with the TensorRT inference server, and I found that with my target model (a TensorRT execution plan that has FP16 inputs and outputs) to max-out my system’s two GPUsm I need to send about 1.2 GBytes per second through the network stack. In my view, this means that scaling this architecture to a server with eight (or even more) GPUs either requires (multiple) IB interconnects, or a preprocessor which is co-located with the inference server, which receives compressed images, and sends raw data to the TRT server.
Once we assume that a preprocessor is located on the same physical node as the TRT inference server (and hope that the CPUs does not become a bottleneck now), then it would be much preferable to submit raw data via IPC (e.g. through /dev/shm
) to the inference server, and thus avoid the overhead introduced by gRPC.
Here are my questions:
- Is the above assessment and the conclusions I draw from it reasonable?
- Do you have “submission of raw data via IPC mechanisms” on your roadmap? E.g. a feature where one submits a reference to the blob of preprocessed data in shared memory to the server via gRPC, and the server then loads this blob and uses it as input. If so, when do you plan on releasing it?
- If I were to implement a version of this myself, do you agree that a first quick-and-dirty approach would be to a) change the gRPC service proto, and then b) change
GRPCInferRequestProvider::GetNextInputContent
intensorrt-inference-server/src/core/infer.cc
accordingly? Did I overlook a place where changes are necessary?
Again, thanks for making this tool available.
Issue Analytics
- State:
- Created 5 years ago
- Comments:17 (9 by maintainers)
Top GitHub Comments
We have just started work on implementing a shared-memory AP (option C). Changes will start to come into master and we expect to have an initial minimal implementation in about 3 weeks. The API will allow input and output tensors to be passed to/from TRTIS via shared-memory instead of over the network. It will be the responsibility of an outside “agent” to create and manage the lifetime of the shared-memory regions. TRTIS will provide APIs that allow that “agent” to register/unregister these shared memory regions with TRTIS and then they can be used in inference requests.
The master branch now has the initial implementation for shared memory support for input tensors and some minimal testing.
Currently only the C++ client API supports shared memory (Python support is TBD… but you can always use grpc to generate client code for many languages). The C++ API changes are here: https://github.com/NVIDIA/tensorrt-inference-server/commit/6d33c8ca8cf5ec7eece925bb997d7f81df6caabe#diff-906ebe14e6f98b22609d12ac8433acc0
An example application is: https://github.com/NVIDIA/tensorrt-inference-server/blob/master/src/clients/c%2B%2B/simple_shm_client.cc. The L0_simple_shared_memory_example test performs some minimal testing using that example application.