Triton server periodically stops responding to `ServerLive`, `ServerReady` and `RepositoryIndex` requests
See original GitHub issueDescription
We have an app that periodically (every 15 seconds) queries triton-server
for liveness (ServerLive
request), readiness (ServerReady
request) and models info (RepositoryIndex
request). For the past weeks, every 1 hour (and now every 15 minutes) triton-server
would stop responding to these requests from the app for about 2-7 minutes.
These are the logs we are getting from our app. The app continues to send ServerLive
requests at the same rate (every 15 seconds) no matter was the response, but it skips sending ServerReady
and RepositoryIndex
requests if ServerLive
fails:
Increasing the context timeout value in the client app from 10 seconds to an hour does not help. We tried sending RepositoryIndex
requests during the bug but they also time out. From triton-server
side, these are the logs we get when the bug happens (with the exclusion of some specific lines/noise that you can see in the filter). We can notice the time jump between the 2 selected lines, that’s exactly when timeout errors start happening in the client app. In particular, we can see that RepositoryIndex
process id 2485
started at 18:11:27.792 and only finished at 18:17:23.783:
During that time jump / bug, ServerLive
and ServerReady
requests also stop working. However, also during the bug, endpoint /v2/health/ready
works and reports that triton-server
is indeed ready.
The problem always happens to all instances of our client app at the same time, no matter was the starting time of our client app instances.
We are running triton-server
as a single pod in a GKE environment version 1.21
, injected by Linkerd
sidecar proxy. but I think we can rule Linkerd
out of this since the problem also happens to a local instance of our client app (running on a laptop) along with the other instances of the client app (running as pods in the cluster) at the same time. The local instance is connected to the same triton-server
pod through its container port using kubectl port-forward
.
The only lead we have is that network traffic increases dramatically some times at the same time the bug happens (not always though). We noticed no weird spikes in other resources usage (CPU, Memory, Disk I/O) during the bug:
Our models are stored on GCS, but we did not activate repository poll, and to my understanding, it should be disabled by default. We start triton-server
process in the container with command:
tritonserver --model-store=gs://test-repo/tfserving --strict-model-config=true --min-supported-compute-capability=3.7 --log-verbose=1 --backend-config=tensorflow,version=2
Triton Information
Using container tritonserver:22.01-py3
To Reproduce
Send ServerLive
, ServerReady
and RepositoryIndex
requests to triton-server
every 15 seconds.
Expected behavior Always get a response for all requests.
Issue Analytics
- State:
- Created a year ago
- Comments:9 (5 by maintainers)
Top GitHub Comments
This most likely means the handler thread(shared between non-infer API calls) is getting stuck. Let’s see if upgrading the gRPC resolves the issue.
The GRPC upgrade will be include in 22.08 which will be released in late August, you may build from source to test it out before the release.