"failed to connect to all addresses" occurs by chance
See original GitHub issueDescription When loading the same developed triton module (python backend), sometime success, sometime get ''failed to connect to all addresses"
Here is a example to show we load the same model, first time fail, and then second time success.
DEVELOPED_MODEL
is our model name.
triton-srv | I0723 03:41:11.552163 1 model_repository_manager.cc:1065] loading: DEVELOPED_MODEL:1
triton-srv | I0723 03:41:11.662098 1 python.cc:604] TRITONBACKEND_ModelInstanceInitialize: DEVELOPED_MODEL(CPU device 0)
triton-srv | E0723 03:46:11.726061 1 model_repository_manager.cc:1242] failed to load 'DEVELOPED_MODEL' version 1: Internal: failed to connect to all addresses
triton-srv | I0723 03:49:14.709051 1 model_repository_manager.cc:1065] loading: DEVELOPED_MODEL:1
triton-srv | I0723 03:49:14.853075 1 python.cc:604] TRITONBACKEND_ModelInstanceInitialize: DEVELOPED_MODEL(CPU device 0)
triton-srv | 2021-07-23 03:50:14,859 - DEVELOPED_MODEL- INFO - Logger Set Up!
triton-srv | 2021-07-23 03:50:15,706 - DEVELOPED_MODEL - INFO - Model Initialization Complete!
triton-srv | I0723 03:50:15.707758 1 model_repository_manager.cc:1239] successfully loaded 'DEVELOPED_MODEL' version 1
We also set python logging in the module, the logger is shown on
triton-srv | 2021-07-23 03:50:14,859 - DEVELOPED_MODEL- INFO - Logger Set Up!
triton-srv | 2021-07-23 03:50:15,706 - DEVELOPED_MODEL - INFO - Model Initialization Complete!
But the first time does not show any python logger, we guess something happens or delay in the triton server.
Our first solution is referring this issue to give a larger timeout value, it works and successfully eliminate the possibility of this kind of errors, but still it fails by chance. (BTW, we set timeout value as 60000 ms)
Triton Information 21.03 - Triton Container
To Reproduce it occurs by chance …
Expected behavior Anyone could explain any possible causes of this problem? I hope I have another solution except for to set timeout to 120000 ms …
Issue Analytics
- State:
- Created 2 years ago
- Comments:5 (4 by maintainers)
Top GitHub Comments
Can you try with the latest version of Triton image?
@Tabrizian Just update situations. After then, we update triton to 21.07, and this error not occur again until now. (but 21.07 seems use more computation resources)