Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Python backend fails with PyTorch > 1.6.0

See original GitHub issue

Description The support matrix mentions that Triton works with PyTorch 1.8.0 (which is not released yet), but the Python backend seems to fail loading Python models when any PyTorch version > 1.6.0 is installed. Error is

...

I0121 11:59:09.866666 411 python.cc:696] TRITONBACKEND_ModelInstanceInitialize: torch-simple_0 (CPU device 0)
Traceback (most recent call last):
  File "/opt/tritonserver/backends/python/startup.py", line 360, in <module>
    python_host = PythonHost(module_path=FLAGS.model_path)
  File "/opt/tritonserver/backends/python/startup.py", line 161, in __init__
    spec.loader.exec_module(module)
  File "<frozen importlib._bootstrap_external>", line 678, in exec_module
  File "<frozen importlib._bootstrap>", line 219, in _call_with_frames_removed
  File "/workspace/models/torch-simple/1/model.py", line 7, in <module>
    import torch
  File "/usr/local/lib/python3.6/dist-packages/torch/__init__.py", line 190, in <module>
    from torch._C import *
ImportError: /usr/local/lib/python3.6/dist-packages/torch/lib/libtorch_python.so: undefined symbol: _ZTIN5torch11distributed3rpc8RpcAgentE

...

I0121 11:59:14.873892 411 server.cc:277] No server context available. Exiting immediately.
error: creating server: Invalid argument - load failed for model 'torch-simple': version 1: Internal: failed to connect to all addresses;

Triton Information What version of Triton are you using? 2.5.0 inside the container 20.11

Are you using the Triton container or did you build it yourself? Triton container nvcr.io/nvidia/tritonserver:20.11-py

To Reproduce Steps to reproduce the behavior.

Running on a machine with a T4 GPU, with CUDA 11.1 (driver 455.45.01)

Model

name: "torch-simple"
backend: "python"
max_batch_size: 10
input [
  {
    name: "input"
    data_type: TYPE_STRING
    dims: [ 1 ]
  }
]
output [
  {
    name: "output_img"
    data_type: TYPE_FP32
    dims: [ 3, 512, 512 ]
  }
]
instance_group [
  {
    count: 1
    kind: KIND_CPU
  }
]
dynamic_batching { }
version_policy: { all { } }

import json

from typing import List

from triton_python_backend_utils import Tensor, InferenceResponse, InferenceRequest

import torch # -> these imports are what makes the loading fail.
import torchvision

class TritonPythonModel(object):
    def __init__(self):
        pass

    def initialize(self, args):
        model_config = json.loads(args['model_config'])

    def execute(self, inference_requests: List[InferenceRequest]) -> List[InferenceResponse]:

        responses = []

        return responses

Run the Triton container and install PyTorch in the default Python interpreter:

docker run -it --rm --runtime=nvidia -e NVIDIA_VISIBLE_DEVICES=0 -v $pwd:/workspace nvcr.io/nvidia/tritonserver:20.11-py

pip3 install torch==1.7.1+cpu torchvision==0.8.2+cpu -f https://download.pytorch.org/whl/torch_stable.html

tritonserver --model-repository=/workspace/models  --log-verbose 1 --model-control-mode=explicit --load-model=torch-simple

The full log:

I0121 12:12:44.834513 575 metrics.cc:219] Collecting metrics for GPU 0: Tesla T4
I0121 12:12:45.157636 575 pinned_memory_manager.cc:199] Pinned memory pool is created at '0x7f912c000000' with size 268435456
I0121 12:12:45.158103 575 cuda_memory_manager.cc:99] CUDA memory pool is created on device 0 with size 67108864
I0121 12:12:45.159018 575 backend_factory.h:44] Create TritonBackendFactory
I0121 12:12:45.159052 575 plan_backend_factory.cc:48] Create PlanBackendFactory
I0121 12:12:45.159066 575 plan_backend_factory.cc:55] Registering TensorRT Plugins
I0121 12:12:45.159111 575 logging.cc:52] Registered plugin creator - ::BatchTilePlugin_TRT version 1
I0121 12:12:45.159126 575 logging.cc:52] Registered plugin creator - ::BatchedNMS_TRT version 1
I0121 12:12:45.159135 575 logging.cc:52] Registered plugin creator - ::BatchedNMSDynamic_TRT version 1
I0121 12:12:45.159147 575 logging.cc:52] Registered plugin creator - ::CoordConvAC version 1
I0121 12:12:45.159161 575 logging.cc:52] Registered plugin creator - ::CropAndResize version 1
I0121 12:12:45.159180 575 logging.cc:52] Registered plugin creator - ::DetectionLayer_TRT version 1
I0121 12:12:45.159196 575 logging.cc:52] Registered plugin creator - ::FlattenConcat_TRT version 1
I0121 12:12:45.159233 575 logging.cc:52] Registered plugin creator - ::GenerateDetection_TRT version 1
I0121 12:12:45.159247 575 logging.cc:52] Registered plugin creator - ::GridAnchor_TRT version 1
I0121 12:12:45.159263 575 logging.cc:52] Registered plugin creator - ::GridAnchorRect_TRT version 1
I0121 12:12:45.159275 575 logging.cc:52] Registered plugin creator - ::InstanceNormalization_TRT version 1
I0121 12:12:45.159291 575 logging.cc:52] Registered plugin creator - ::LReLU_TRT version 1
I0121 12:12:45.159303 575 logging.cc:52] Registered plugin creator - ::MultilevelCropAndResize_TRT version 1
I0121 12:12:45.159317 575 logging.cc:52] Registered plugin creator - ::MultilevelProposeROI_TRT version 1
I0121 12:12:45.159326 575 logging.cc:52] Registered plugin creator - ::NMS_TRT version 1
I0121 12:12:45.159334 575 logging.cc:52] Registered plugin creator - ::Normalize_TRT version 1
I0121 12:12:45.159343 575 logging.cc:52] Registered plugin creator - ::PriorBox_TRT version 1
I0121 12:12:45.159354 575 logging.cc:52] Registered plugin creator - ::ProposalLayer_TRT version 1
I0121 12:12:45.159368 575 logging.cc:52] Registered plugin creator - ::Proposal version 1
I0121 12:12:45.159379 575 logging.cc:52] Registered plugin creator - ::PyramidROIAlign_TRT version 1
I0121 12:12:45.159393 575 logging.cc:52] Registered plugin creator - ::Region_TRT version 1
I0121 12:12:45.159408 575 logging.cc:52] Registered plugin creator - ::Reorg_TRT version 1
I0121 12:12:45.159420 575 logging.cc:52] Registered plugin creator - ::ResizeNearest_TRT version 1
I0121 12:12:45.159429 575 logging.cc:52] Registered plugin creator - ::RPROI_TRT version 1
I0121 12:12:45.159441 575 logging.cc:52] Registered plugin creator - ::SpecialSlice_TRT version 1
I0121 12:12:45.159458 575 logging.cc:52] Registered plugin creator - ::Split version 1
I0121 12:12:45.159475 575 libtorch_backend_factory.cc:53] Create LibTorchBackendFactory
I0121 12:12:45.159493 575 custom_backend_factory.cc:46] Create CustomBackendFactory
I0121 12:12:45.159509 575 ensemble_backend_factory.cc:47] Create EnsembleBackendFactory
I0121 12:12:45.160867 575 model_repository_manager.cc:578] AsyncLoad() 'torch-simple'
I0121 12:12:45.160917 575 model_repository_manager.cc:753] TriggerNextAction() 'torch-simple' version 1: 1
I0121 12:12:45.160931 575 model_repository_manager.cc:791] Load() 'torch-simple' version 1
I0121 12:12:45.160938 575 model_repository_manager.cc:810] loading: torch-simple:1
I0121 12:12:45.161139 575 model_repository_manager.cc:863] CreateInferenceBackend() 'torch-simple' version 1
I0121 12:12:45.162517 575 python.cc:567] 'python' TRITONBACKEND API version: 1.0
I0121 12:12:45.162541 575 python.cc:587] backend configuration:
{}
I0121 12:12:45.162761 575 python.cc:647] TRITONBACKEND_ModelInitialize: torch-simple (version 1)
I0121 12:12:45.163674 575 model_config_utils.cc:1557] ModelConfig 64-bit fields:
I0121 12:12:45.163694 575 model_config_utils.cc:1559]   ModelConfig::dynamic_batching::default_queue_policy::default_timeout_microseconds
I0121 12:12:45.163700 575 model_config_utils.cc:1559]   ModelConfig::dynamic_batching::max_queue_delay_microseconds
I0121 12:12:45.163710 575 model_config_utils.cc:1559]   ModelConfig::dynamic_batching::priority_queue_policy::value::default_timeout_microseconds
I0121 12:12:45.163723 575 model_config_utils.cc:1559]   ModelConfig::ensemble_scheduling::step::model_version
I0121 12:12:45.163730 575 model_config_utils.cc:1559]   ModelConfig::input::dims
I0121 12:12:45.163735 575 model_config_utils.cc:1559]   ModelConfig::input::reshape::shape
I0121 12:12:45.163740 575 model_config_utils.cc:1559]   ModelConfig::model_warmup::inputs::value::dims
I0121 12:12:45.163746 575 model_config_utils.cc:1559]   ModelConfig::optimization::cuda::graph_spec::graph_lower_bound::input::value::dim
I0121 12:12:45.163752 575 model_config_utils.cc:1559]   ModelConfig::optimization::cuda::graph_spec::input::value::dim
I0121 12:12:45.163762 575 model_config_utils.cc:1559]   ModelConfig::output::dims
I0121 12:12:45.163768 575 model_config_utils.cc:1559]   ModelConfig::output::reshape::shape
I0121 12:12:45.163775 575 model_config_utils.cc:1559]   ModelConfig::sequence_batching::direct::max_queue_delay_microseconds
I0121 12:12:45.163781 575 model_config_utils.cc:1559]   ModelConfig::sequence_batching::max_sequence_idle_microseconds
I0121 12:12:45.163793 575 model_config_utils.cc:1559]   ModelConfig::sequence_batching::oldest::max_queue_delay_microseconds
I0121 12:12:45.163798 575 model_config_utils.cc:1559]   ModelConfig::version_policy::specific::versions
I0121 12:12:45.163931 575 python.cc:696] TRITONBACKEND_ModelInstanceInitialize: torch-simple_0 (CPU device 0)
Traceback (most recent call last):
  File "/opt/tritonserver/backends/python/startup.py", line 360, in <module>
    python_host = PythonHost(module_path=FLAGS.model_path)
  File "/opt/tritonserver/backends/python/startup.py", line 161, in __init__
    spec.loader.exec_module(module)
  File "<frozen importlib._bootstrap_external>", line 678, in exec_module
  File "<frozen importlib._bootstrap>", line 219, in _call_with_frames_removed
  File "/workspace/models/torch-simple/1/model.py", line 7, in <module>
    import torch
  File "/usr/local/lib/python3.6/dist-packages/torch/__init__.py", line 190, in <module>
    from torch._C import *
ImportError: /usr/local/lib/python3.6/dist-packages/torch/lib/libtorch_python.so: undefined symbol: _ZTIN5torch11distributed3rpc8RpcAgentE
I0121 12:12:50.169332 575 python.cc:1017] TRITONBACKEND_ModelInstanceFinalize: delete instance state
I0121 12:12:50.170002 575 python.cc:418] GRPC shutdown complete
I0121 12:12:50.170043 575 python.cc:672] TRITONBACKEND_ModelFinalize: delete model state
I0121 12:12:50.170064 575 triton_backend_manager.cc:130] unloading backend 'python'
I0121 12:12:50.170078 575 python.cc:627] TRITONBACKEND_Finalize: Start
I0121 12:12:50.170084 575 python.cc:632] TRITONBACKEND_Finalize: End
E0121 12:12:50.170498 575 model_repository_manager.cc:986] failed to load 'torch-simple' version 1: Internal: failed to connect to all addresses
I0121 12:12:50.170522 575 model_repository_manager.cc:753] TriggerNextAction() 'torch-simple' version 1: 0
I0121 12:12:50.170531 575 model_repository_manager.cc:768] no next action, trigger OnComplete()
I0121 12:12:50.170642 575 model_repository_manager.cc:475] VersionStates() 'torch-simple'
I0121 12:12:50.170926 575 tritonserver.cc:1620] 
+----------------------------------+------------------------------------------------------------------------------------------------------+
| Option                           | Value                                                                                                |
+----------------------------------+------------------------------------------------------------------------------------------------------+
| server_id                        | triton                                                                                               |
| server_version                   | 2.5.0                                                                                                |
| server_extensions                | classification sequence model_repository schedule_policy model_configuration system_shared_memory cu |
|                                  | da_shared_memory binary_tensor_data statistics                                                       |
| model_repository_path[0]         | /workspace/models                                                                                    |
| model_control_mode               | MODE_EXPLICIT                                                                                        |
| startup_models_0                 | torch-simple                                                                                         |
| strict_model_config              | 1                                                                                                    |
| pinned_memory_pool_byte_size     | 268435456                                                                                            |
| cuda_memory_pool_byte_size{0}    | 67108864                                                                                             |
| min_supported_compute_capability | 6.0                                                                                                  |
| strict_readiness                 | 1                                                                                                    |
| exit_timeout                     | 30                                                                                                   |
+----------------------------------+------------------------------------------------------------------------------------------------------+

I0121 12:12:50.170988 575 server.cc:277] No server context available. Exiting immediately.
error: creating server: Invalid argument - load failed for model 'torch-simple': version 1: Internal: failed to connect to all addresses;

Building PyTorch from source at commit https://github.com/pytorch/pytorch/commit/17f8c32 (commit mentioned in the support matrix) will fail with the same error.

Removing the import torch and import torchvision statements will make the model load successfully.

Installing PyTorch 1.6.0, as in these tests will make the model load successfully.

Is there any best practices as to how to install PyTorch (or Tensorflow) to make them run with the Python backend? This is useful, e.g., to pre-process images before sending them to the inference model.

Expected behavior Successful model loading with PyTorch > 1.6.0

Issue Analytics

State:
Created 3 years ago
Comments:10 (3 by maintainers)

Top GitHub Comments

2reactions

Tabriziancommented, Jan 22, 2021

@phuotran I believe this issue is resolved in the 20.12 release. Feel free to reopen this issue if you still saw this in 20.12.

/cc @CoderHam

1reaction

yihui-hecommented, Jun 11, 2021

@ukemamaster Hi, glad to hear that you solved the problem. Could I know more details? I tried installing pytorch while building docker image from Dockerfile. Still got the same exact error.

Top Results From Across the Web

CPU Training free(): invalid next size (normal) error - autograd

Hi All, Have a weird bug occurring. On torch 1.5.1, training my network on CPU works perfectly fine. However, when upgrading to torch...

How does one use Pytorch (+ cuda) with an A100 GPU?

This solved the problem for me. ... I had the same problem. ... now python -c "import torch;print(torch.version.cuda)" returns 11.3 (though ...

PyTorch 1.6.0 Now Available | Exxact Blog

We are making this a hard error starting from PyTorch 1.6.0; please modify ... Added a warning to a known autograd issue on...

Horovod Installation Guide

Python >= 3.6 ... You can build Horovod for TensorFlow, PyTorch, and MXNet. ... MXNet 1.6.0.post0 and 1.7.0.post0 are only available as mxnet-cu101...

Install Anaconda 5.1.0 with python 3.6 - Medium

Install Python 3.6/Keras 2.1.5/Tensorflow GPU 1.6/PyTorch ... N.B. CUDA 9.1 won't work with tensorflow version 1.6.0 and below.