question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

error: creating server: INTERNAL - failed to load all models

See original GitHub issue

Description A clear and concise description of what the bug is.

I just run a simple demo. The model is downloaded by tensorflow.keras.applications.resnet50 and saved with model.save('./resnet50', save_format='tf').

structure:

                                                                                (tf)
models
├── resnet50
│  ├── 1
│  │  └── model.savedmodel
│  │     ├── assets
│  │     ├── saved_model.pb
│  │     └── variables
│  │        ├── variables.data-00000-of-00001
│  │        └── variables.index
│  ├── config.pbtxt
│  └── resnet50_label.txt

config.pbtxt:

name: "resnet50"
platform: "tensorflow_savedmodel"
max_batch_size: 128
input [
{
    name: "input_1"
    data_type: TYPE_FP32
    format: FORMAT_NHWC
    dims: [224, 224, 3]
}
]
output [
{
    name: "predictions"
    data_type: TYPE_FP32
    dims: [ 1000 ]
    label_filename: "resnet50_label.txt"
}
]
instance_group [
{
    count: 1
    kind: KIND_GPU
    gpus: [0]
}
]
dynamic_batching {
    preferred_batch_size: [32, 64]
    max_queue_delay_microseconds: 10
}

Triton Information What version of Triton are you using? nvcr.io/nvidia/tensorrtserver:19.10-py3

To Reproduce Steps to reproduce the behavior.

nvidia-docker run --rm --shm-size=1g --ulimit memlock=-1 --ulimit stack=67108864 -p8000:8000 -p8001:8001 -p8002:8002 -v /home/me/models:/models nvcr.io/nvidia/tensorrtserver:19.10-py3 trtserver --model-repository=/models

I’m sure the GPU driver is compatible with this docker. I can run another TensorFlow model successfully.

output:

===============================
== TensorRT Inference Server ==
===============================

NVIDIA Release 19.10 (build 8266503)

Copyright (c) 2018-2019, NVIDIA CORPORATION.  All rights reserved.

Various files include modifications (c) NVIDIA CORPORATION.  All rights reserved.
NVIDIA modifications are covered by the license terms that apply to the underlying
project or file.

I0415 02:47:12.663739 1 metrics.cc:160] found 3 GPUs supporting NVML metrics
I0415 02:47:12.669485 1 metrics.cc:169]   GPU 0: Tesla V100-PCIE-16GB
I0415 02:47:12.675331 1 metrics.cc:169]   GPU 1: Tesla V100-PCIE-16GB
I0415 02:47:12.681248 1 metrics.cc:169]   GPU 2: Tesla V100-PCIE-16GB
I0415 02:47:12.681421 1 server.cc:110] Initializing TensorRT Inference Server
E0415 02:47:12.795065 1 model_repository_manager.cc:1453] failed to open text file for read /models/.git/config.pbtxt: No such file or directory
E0415 02:47:12.795112 1 model_repository_manager.cc:1453] failed to open text file for read /models/.vscode/config.pbtxt: No such file or directory
E0415 02:47:12.795224 1 model_repository_manager.cc:1453] failed to open text file for read /models/models/config.pbtxt: No such file or directory
I0415 02:47:12.799405 1 server_status.cc:83] New status tracking for model 'resnet50'
I0415 02:47:12.799463 1 model_repository_manager.cc:663] loading: resnet50:1
I0415 02:47:12.802320 1 base_backend.cc:166] Creating instance resnet50_0_0_gpu0 on GPU 0 (7.0) using model.savedmodel
2020-04-15 02:47:12.916423: I tensorflow/cc/saved_model/reader.cc:31] Reading SavedModel from: /models/resnet50/1/model.savedmodel
2020-04-15 02:47:12.979608: I tensorflow/cc/saved_model/reader.cc:54] Reading meta graph with tags { serve }
2020-04-15 02:47:13.096501: I tensorflow/core/platform/profile_utils/cpu_utils.cc:94] CPU Frequency: 2100000000 Hz
2020-04-15 02:47:13.100001: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x7fa836c7bd40 executing computations on platform Host. Devices:
2020-04-15 02:47:13.100032: I tensorflow/compiler/xla/service/service.cc:175]   StreamExecutor device (0): <undefined>, <undefined>
2020-04-15 02:47:13.100141: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcuda.so.1
2020-04-15 02:47:13.348829: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1640] Found device 0 with properties:
name: Tesla V100-PCIE-16GB major: 7 minor: 0 memoryClockRate(GHz): 1.38
pciBusID: 0000:3b:00.0
2020-04-15 02:47:13.349927: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1640] Found device 1 with properties:
name: Tesla V100-PCIE-16GB major: 7 minor: 0 memoryClockRate(GHz): 1.38
pciBusID: 0000:af:00.0
2020-04-15 02:47:13.352179: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1640] Found device 2 with properties:
name: Tesla V100-PCIE-16GB major: 7 minor: 0 memoryClockRate(GHz): 1.38
pciBusID: 0000:d8:00.0
2020-04-15 02:47:13.352188: I tensorflow/stream_executor/platform/default/dlopen_checker_stub.cc:25] GPU libraries are statically linked, skip dlopen check.
2020-04-15 02:47:13.358074: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1763] Adding visible gpu devices: 0, 1, 2
2020-04-15 02:47:21.339782: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1181] Device interconnect StreamExecutor with strength 1 edge matrix:
2020-04-15 02:47:21.339820: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1187]      0 1 2
2020-04-15 02:47:21.339827: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1200] 0:   N Y Y
2020-04-15 02:47:21.339831: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1200] 1:   Y N Y
2020-04-15 02:47:21.339835: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1200] 2:   Y Y N
2020-04-15 02:47:21.345171: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1326] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 14485 MB memory) -> physical GPU (device: 0, name: Tesla V100-PCIE-16GB, pci bus id: 0000:3b:00.0, compute capability: 7.0)
2020-04-15 02:47:21.347898: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1326] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:1 with 14485 MB memory) -> physical GPU (device: 1, name: Tesla V100-PCIE-16GB, pci bus id: 0000:af:00.0, compute capability: 7.0)
2020-04-15 02:47:21.350249: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1326] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:2 with 14485 MB memory) -> physical GPU (device: 2, name: Tesla V100-PCIE-16GB, pci bus id: 0000:d8:00.0, compute capability: 7.0)
2020-04-15 02:47:21.354405: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x7fa3f2795160 executing computations on platform CUDA. Devices:
2020-04-15 02:47:21.354421: I tensorflow/compiler/xla/service/service.cc:175]   StreamExecutor device (0): Tesla V100-PCIE-16GB, Compute Capability 7.0
2020-04-15 02:47:21.354427: I tensorflow/compiler/xla/service/service.cc:175]   StreamExecutor device (1): Tesla V100-PCIE-16GB, Compute Capability 7.0
2020-04-15 02:47:21.354432: I tensorflow/compiler/xla/service/service.cc:175]   StreamExecutor device (2): Tesla V100-PCIE-16GB, Compute Capability 7.0
2020-04-15 02:47:21.595657: I tensorflow/cc/saved_model/loader.cc:204] Restoring SavedModel bundle.
2020-04-15 02:47:22.342995: I tensorflow/cc/saved_model/loader.cc:153] Running initialization op on SavedModel bundle at path: /models/resnet50/1/model.savedmodel
2020-04-15 02:47:22.600629: I tensorflow/cc/saved_model/loader.cc:332] SavedModel load for tags { serve }; Status: success. Took 9684222 microseconds.
I0415 02:47:22.600987 1 model_repository_manager.cc:807] successfully loaded 'resnet50' version 1
I0415 02:47:22.801777 1 model_repository_manager.cc:793] successfully unloaded 'resnet50' version 1
E0415 02:47:22.801841 1 main.cc:1099] error: creating server: INTERNAL - failed to load all models

Describe the models (framework, inputs, outputs), ideally include the model configuration file (if using an ensemble include the model configuration file for that as well).

Expected behavior A clear and concise description of what you expected to happen.

Sorry, I cannot find any useful error messages.

Issue Analytics

  • State:closed
  • Created 3 years ago
  • Comments:16

github_iconTop GitHub Comments

2reactions
kemingycommented, Apr 16, 2020

that’s correct, all models in the dirs will be checked when trtserver started. There is an option --exit-on-error=false may help.

I agree all the models in the dirs should be checked. But maybe it should skip the dirs named started with a dot? For example, .git/, .vscode/.

1reaction
kemingycommented, Apr 15, 2020

sorry, using your zip file, i cannot reproduce your results on 20.02, 19.09 , 19.10 trtserver images. everything seems ok and says “successfully loaded ‘resnet50’ version 1”.

I moved it to a clean directory and it works. Is it because of the old folder contains several non-model directories? This doesn’t make sense.

E0415 02:47:12.795065 1 model_repository_manager.cc:1453] failed to open text file for read /models/.git/config.pbtxt: No such file or directory
E0415 02:47:12.795112 1 model_repository_manager.cc:1453] failed to open text file for read /models/.vscode/config.pbtxt: No such file or directory
E0415 02:47:12.795224 1 model_repository_manager.cc:1453] failed to open text file for read /models/models/config.pbtxt: No such file or directory

Then I try to git init. After that, it failed to load the models.

Read more comments on GitHub >

github_iconTop Results From Across the Web

error: creating server: INTERNAL - failed to load all models ...
I just run a simple demo. The model is downloaded by tensorflow.keras.applications.resnet50 and saved with model.save('./resnet50', save_format ...
Read more >
Triton server died before reaching ready state. Terminating ...
Hi, I want to set up the Jarvis server with jarvis_init.sh, but is facing a ... error: creating server: Internal - failed to...
Read more >
Triton Inference Server: The Basics and a Quick Tutorial
Learn about the NVIDIA Triton Inference Server, its key features, models and model repositories, client libraries, and get started with a quick tutorial....
Read more >
Security Xray Scan Knife Detection - Seeed Wiki
Knife Detection: An Object Detection Model deployed on Triton Inference Sever based on ... if error: creating server: Internal - failed to load...
Read more >
triton-inference-server启动报Internal - failed to load all models
《2010年计算机专业统考试题数据结构》由会员分享,可在线阅读,更多相关《2010年计算机专业统考试题数据结构(23页珍藏版)》请在人人文库网上搜索。
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found