error: creating server: INTERNAL - failed to load all models
See original GitHub issueDescription A clear and concise description of what the bug is.
I just run a simple demo. The model is downloaded by tensorflow.keras.applications.resnet50
and saved with model.save('./resnet50', save_format='tf')
.
structure:
(tf)
models
├── resnet50
│ ├── 1
│ │ └── model.savedmodel
│ │ ├── assets
│ │ ├── saved_model.pb
│ │ └── variables
│ │ ├── variables.data-00000-of-00001
│ │ └── variables.index
│ ├── config.pbtxt
│ └── resnet50_label.txt
config.pbtxt:
name: "resnet50"
platform: "tensorflow_savedmodel"
max_batch_size: 128
input [
{
name: "input_1"
data_type: TYPE_FP32
format: FORMAT_NHWC
dims: [224, 224, 3]
}
]
output [
{
name: "predictions"
data_type: TYPE_FP32
dims: [ 1000 ]
label_filename: "resnet50_label.txt"
}
]
instance_group [
{
count: 1
kind: KIND_GPU
gpus: [0]
}
]
dynamic_batching {
preferred_batch_size: [32, 64]
max_queue_delay_microseconds: 10
}
Triton Information
What version of Triton are you using?
nvcr.io/nvidia/tensorrtserver:19.10-py3
To Reproduce Steps to reproduce the behavior.
nvidia-docker run --rm --shm-size=1g --ulimit memlock=-1 --ulimit stack=67108864 -p8000:8000 -p8001:8001 -p8002:8002 -v /home/me/models:/models nvcr.io/nvidia/tensorrtserver:19.10-py3 trtserver --model-repository=/models
I’m sure the GPU driver is compatible with this docker. I can run another TensorFlow model successfully.
output:
===============================
== TensorRT Inference Server ==
===============================
NVIDIA Release 19.10 (build 8266503)
Copyright (c) 2018-2019, NVIDIA CORPORATION. All rights reserved.
Various files include modifications (c) NVIDIA CORPORATION. All rights reserved.
NVIDIA modifications are covered by the license terms that apply to the underlying
project or file.
I0415 02:47:12.663739 1 metrics.cc:160] found 3 GPUs supporting NVML metrics
I0415 02:47:12.669485 1 metrics.cc:169] GPU 0: Tesla V100-PCIE-16GB
I0415 02:47:12.675331 1 metrics.cc:169] GPU 1: Tesla V100-PCIE-16GB
I0415 02:47:12.681248 1 metrics.cc:169] GPU 2: Tesla V100-PCIE-16GB
I0415 02:47:12.681421 1 server.cc:110] Initializing TensorRT Inference Server
E0415 02:47:12.795065 1 model_repository_manager.cc:1453] failed to open text file for read /models/.git/config.pbtxt: No such file or directory
E0415 02:47:12.795112 1 model_repository_manager.cc:1453] failed to open text file for read /models/.vscode/config.pbtxt: No such file or directory
E0415 02:47:12.795224 1 model_repository_manager.cc:1453] failed to open text file for read /models/models/config.pbtxt: No such file or directory
I0415 02:47:12.799405 1 server_status.cc:83] New status tracking for model 'resnet50'
I0415 02:47:12.799463 1 model_repository_manager.cc:663] loading: resnet50:1
I0415 02:47:12.802320 1 base_backend.cc:166] Creating instance resnet50_0_0_gpu0 on GPU 0 (7.0) using model.savedmodel
2020-04-15 02:47:12.916423: I tensorflow/cc/saved_model/reader.cc:31] Reading SavedModel from: /models/resnet50/1/model.savedmodel
2020-04-15 02:47:12.979608: I tensorflow/cc/saved_model/reader.cc:54] Reading meta graph with tags { serve }
2020-04-15 02:47:13.096501: I tensorflow/core/platform/profile_utils/cpu_utils.cc:94] CPU Frequency: 2100000000 Hz
2020-04-15 02:47:13.100001: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x7fa836c7bd40 executing computations on platform Host. Devices:
2020-04-15 02:47:13.100032: I tensorflow/compiler/xla/service/service.cc:175] StreamExecutor device (0): <undefined>, <undefined>
2020-04-15 02:47:13.100141: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcuda.so.1
2020-04-15 02:47:13.348829: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1640] Found device 0 with properties:
name: Tesla V100-PCIE-16GB major: 7 minor: 0 memoryClockRate(GHz): 1.38
pciBusID: 0000:3b:00.0
2020-04-15 02:47:13.349927: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1640] Found device 1 with properties:
name: Tesla V100-PCIE-16GB major: 7 minor: 0 memoryClockRate(GHz): 1.38
pciBusID: 0000:af:00.0
2020-04-15 02:47:13.352179: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1640] Found device 2 with properties:
name: Tesla V100-PCIE-16GB major: 7 minor: 0 memoryClockRate(GHz): 1.38
pciBusID: 0000:d8:00.0
2020-04-15 02:47:13.352188: I tensorflow/stream_executor/platform/default/dlopen_checker_stub.cc:25] GPU libraries are statically linked, skip dlopen check.
2020-04-15 02:47:13.358074: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1763] Adding visible gpu devices: 0, 1, 2
2020-04-15 02:47:21.339782: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1181] Device interconnect StreamExecutor with strength 1 edge matrix:
2020-04-15 02:47:21.339820: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1187] 0 1 2
2020-04-15 02:47:21.339827: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1200] 0: N Y Y
2020-04-15 02:47:21.339831: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1200] 1: Y N Y
2020-04-15 02:47:21.339835: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1200] 2: Y Y N
2020-04-15 02:47:21.345171: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1326] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 14485 MB memory) -> physical GPU (device: 0, name: Tesla V100-PCIE-16GB, pci bus id: 0000:3b:00.0, compute capability: 7.0)
2020-04-15 02:47:21.347898: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1326] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:1 with 14485 MB memory) -> physical GPU (device: 1, name: Tesla V100-PCIE-16GB, pci bus id: 0000:af:00.0, compute capability: 7.0)
2020-04-15 02:47:21.350249: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1326] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:2 with 14485 MB memory) -> physical GPU (device: 2, name: Tesla V100-PCIE-16GB, pci bus id: 0000:d8:00.0, compute capability: 7.0)
2020-04-15 02:47:21.354405: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x7fa3f2795160 executing computations on platform CUDA. Devices:
2020-04-15 02:47:21.354421: I tensorflow/compiler/xla/service/service.cc:175] StreamExecutor device (0): Tesla V100-PCIE-16GB, Compute Capability 7.0
2020-04-15 02:47:21.354427: I tensorflow/compiler/xla/service/service.cc:175] StreamExecutor device (1): Tesla V100-PCIE-16GB, Compute Capability 7.0
2020-04-15 02:47:21.354432: I tensorflow/compiler/xla/service/service.cc:175] StreamExecutor device (2): Tesla V100-PCIE-16GB, Compute Capability 7.0
2020-04-15 02:47:21.595657: I tensorflow/cc/saved_model/loader.cc:204] Restoring SavedModel bundle.
2020-04-15 02:47:22.342995: I tensorflow/cc/saved_model/loader.cc:153] Running initialization op on SavedModel bundle at path: /models/resnet50/1/model.savedmodel
2020-04-15 02:47:22.600629: I tensorflow/cc/saved_model/loader.cc:332] SavedModel load for tags { serve }; Status: success. Took 9684222 microseconds.
I0415 02:47:22.600987 1 model_repository_manager.cc:807] successfully loaded 'resnet50' version 1
I0415 02:47:22.801777 1 model_repository_manager.cc:793] successfully unloaded 'resnet50' version 1
E0415 02:47:22.801841 1 main.cc:1099] error: creating server: INTERNAL - failed to load all models
Describe the models (framework, inputs, outputs), ideally include the model configuration file (if using an ensemble include the model configuration file for that as well).
Expected behavior A clear and concise description of what you expected to happen.
Sorry, I cannot find any useful error messages.
Issue Analytics
- State:
- Created 3 years ago
- Comments:16
Top GitHub Comments
I agree all the models in the dirs should be checked. But maybe it should skip the dirs named started with a dot? For example,
.git/
,.vscode/
.I moved it to a clean directory and it works. Is it because of the old folder contains several non-model directories? This doesn’t make sense.
Then I try to
git init
. After that, it failed to load the models.