question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Triton hangs on tensorflow1 backend cpu-only build

See original GitHub issue

Description Triton hangs during the initialization of the Tensorflow 1 runtime.

Triton Information What version of Triton are you using? 2.20

Are you using the Triton container or did you build it yourself? Triton container built using build.py convenience script

To Reproduce Steps to reproduce the behavior. All of this is tested on an aarch64 and x86 non-cuda machine. Docker image build using command:

./build.py --cmake-dir=$(pwd)/build --build-dir=/tmp/citritonbuild --enable-logging --enable-stats --enable-tracing --enable-metrics --endpoint=http --endpoint=grpc --backend=tensorflow1 --extra-backend-cmake-arg=tensorflow1:TRITON_TENSORFLOW_INSTALL_EXTRA_DEPS=ON

Docker run command was:

docker run --rm -it --entrypoint="" -v $(pwd)/triton_model_repo:/models tritonserver bash

After doing this the server output should look something like this then hang:

root@b0c1ff746084:/opt/tritonserver# tritonserver --model-repository /models
2022-03-30 16:36:43.807897: W tensorflow/stream_executor/platform/default/dso_loader.cc:60] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory
2022-03-30 16:36:43.808035: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.
2022-03-30 16:36:43.808230: W tensorflow/stream_executor/platform/default/dso_loader.cc:60] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory
2022-03-30 16:36:43.808262: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.

Describe the models (framework, inputs, outputs), ideally include the model configuration file (if using an ensemble include the model configuration file for that as well). Hangs before even looking at the model repository. You can pass an invalid directory for the model repository and the behavior is the same.

Expected behavior Tensorflow1 model should load properly and server shouldn’t hang on a cpu only build of Triton.

Issue Analytics

  • State:closed
  • Created a year ago
  • Comments:13 (13 by maintainers)

github_iconTop GitHub Comments

1reaction
jishminorcommented, Mar 30, 2022

Yes running from the r22.03 branch as opposed to the release tag. I can also comment I saw the same behavior when running on main yesterday before the 22.03 release.

0reactions
jishminorcommented, Mar 31, 2022

@CoderHam Can confirm this PR fixed the hang issue. Here are the log outputs for reference:

root@2172fe71744f:/opt/tritonserver# tritonserver --model-repository /models --model-control-mode explicit --load-model inceptionv3
2022-03-31 17:51:04.902447: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.11.0
I0331 17:51:05.000586 9 tensorflow.cc:2181] TRITONBACKEND_Initialize: tensorflow
I0331 17:51:05.000687 9 tensorflow.cc:2191] Triton TRITONBACKEND API version: 1.9
I0331 17:51:05.000724 9 tensorflow.cc:2197] 'tensorflow' TRITONBACKEND API version: 1.9
I0331 17:51:05.000755 9 tensorflow.cc:2221] backend configuration:
{}
I0331 17:51:05.003629 9 model_repository_manager.cc:1028] loading: inceptionv3:1
I0331 17:51:05.104028 9 tensorflow.cc:2281] TRITONBACKEND_ModelInitialize: inceptionv3 (version 1)
I0331 17:51:05.106562 9 tensorflow.cc:2330] TRITONBACKEND_ModelInstanceInitialize: inceptionv3_0_0 (CPU device 0)
2022-03-31 17:51:05.172819: W tensorflow/core/platform/profile_utils/cpu_utils.cc:98] Failed to find bogomips in /proc/cpuinfo; cannot determine CPU frequency
2022-03-31 17:51:05.188565: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0xffff30587900 initialized for platform Host (this does not guarantee that XLA will be used). Devices:
2022-03-31 17:51:05.188620: I tensorflow/compiler/xla/service/service.cc:176]   StreamExecutor device (0): Host, Default Version
2022-03-31 17:51:05.192564: W tensorflow/stream_executor/platform/default/dso_loader.cc:60] Could not load dynamic library 'libcuda.so.1'; dlerror: libcuda.so.1: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/cuda/targets/sbsa-linux/lib:/usr/local/cuda/lib64/stubs:
2022-03-31 17:51:05.192609: E tensorflow/stream_executor/cuda/cuda_driver.cc:282] failed call to cuInit: UNKNOWN ERROR (303)
2022-03-31 17:51:05.192671: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:156] kernel driver does not appear to be running on this host (2172fe71744f): /proc/driver/nvidia/version does not exist
W0331 17:51:05.348464 9 pinned_memory_manager.cc:133] failed to allocate pinned system memory: no pinned memory pool, falling back to non-pinned system memory
I0331 17:51:07.344419 9 tensorflow.cc:2330] TRITONBACKEND_ModelInstanceInitialize: inceptionv3_0_1 (CPU device 0)
I0331 17:51:09.246006 9 model_repository_manager.cc:1183] successfully loaded 'inceptionv3' version 1
I0331 17:51:09.246240 9 server.cc:522]
+------------------+------+
| Repository Agent | Path |
+------------------+------+
+------------------+------+

I0331 17:51:09.246368 9 server.cc:549]
+------------+-----------------------------------------------------------------+--------+
| Backend    | Path                                                            | Config |
+------------+-----------------------------------------------------------------+--------+
| tensorflow | /opt/tritonserver/backends/tensorflow1/libtriton_tensorflow1.so | {}     |
+------------+-----------------------------------------------------------------+--------+

I0331 17:51:09.246570 9 server.cc:592]
+-------------+---------+--------+
| Model       | Version | Status |
+-------------+---------+--------+
| inceptionv3 | 1       | READY  |
+-------------+---------+--------+

W0331 17:51:09.246615 9 metrics.cc:325] Neither cache metrics nor gpu metrics are enabled. Not polling for them.
I0331 17:51:09.246915 9 tritonserver.cc:2123]
+----------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| Option                           | Value                                                                                                                                                                          |
+----------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| server_id                        | triton                                                                                                                                                                         |
| server_version                   | 2.21.0dev                                                                                                                                                                      |
| server_extensions                | classification sequence model_repository model_repository(unload_dependents) schedule_policy model_configuration system_shared_memory cuda_shared_memory binary_tensor_data st |
|                                  | atistics trace                                                                                                                                                                 |
| model_repository_path[0]         | /models                                                                                                                                                                        |
| model_control_mode               | MODE_EXPLICIT                                                                                                                                                                  |
| startup_models_0                 | inceptionv3                                                                                                                                                                    |
| strict_model_config              | 1                                                                                                                                                                              |
| rate_limit                       | OFF                                                                                                                                                                            |
| pinned_memory_pool_byte_size     | 268435456                                                                                                                                                                      |
| response_cache_byte_size         | 0                                                                                                                                                                              |
| min_supported_compute_capability | 0.0                                                                                                                                                                            |
| strict_readiness                 | 1                                                                                                                                                                              |
| exit_timeout                     | 30                                                                                                                                                                             |
+----------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+

I0331 17:51:09.251606 9 grpc_server.cc:4542] Started GRPCInferenceService at 0.0.0.0:8001
I0331 17:51:09.252570 9 http_server.cc:3239] Started HTTPService at 0.0.0.0:8000
I0331 17:51:09.294148 9 http_server.cc:178] Started Metrics Service at 0.0.0.0:8002
Read more comments on GitHub >

github_iconTop Results From Across the Web

Triton Inference Server - Iron Bank Containers - Repo One
Triton CPU-only build now supports TensorFlow2 backend for Linux x86. Implicit state management can be used for ONNX Runtime and TensorRT backends.
Read more >
Install and Setup — DGL 0.9.1post1 documentation - DGL Docs
For requirements on backends and how to select one, see Working with different backends. Starting at version 0.3, DGL is separated into CPU...
Read more >
Triton Inference Server: The Basics and a Quick Tutorial
NVIDIA's open-source Triton Inference Server offers backend support for most machine ... Triton supports multiple formats, including TensorFlow 1.x and 2.x, ...
Read more >
Triton Inference Server Release 21.05
Triton on Jetson now supports ONNX via the ONNX Runtime backend. The Triton server and HTTP clients (Python and C++) now support compression....
Read more >
Distributed communication package - torch.distributed - PyTorch
MPI is an optional backend that can only be included if you build PyTorch from ... which may be helpful when debugging hangs,...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found