Status Message: CUDNN error executing cudnnFindConvolutionForwardAlgorithmEx
See original GitHub issueDescription I am working on a Triton C-API application in combination with ROS1 to do inference with a YOLOv5 custom model on ROS1 image topics. I have a working implementation of the same model with gRPC mode so the model and the config are correct. When I send the normalized image to the triton, it gives me the following error. I tried googling this error but cannot make sense of what exactly is the problem here. Some insights would help a lot.
I0609 12:54:24.923918 21492 libtorch.cc:1381] TRITONBACKEND_Initialize: pytorch
I0609 12:54:24.923949 21492 libtorch.cc:1391] Triton TRITONBACKEND API version: 1.9
I0609 12:54:24.923953 21492 libtorch.cc:1397] 'pytorch' TRITONBACKEND API version: 1.9
2022-06-09 12:54:29.035423: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.11.0
I0609 12:54:29.076962 21492 tensorflow.cc:2181] TRITONBACKEND_Initialize: tensorflow
I0609 12:54:29.076984 21492 tensorflow.cc:2191] Triton TRITONBACKEND API version: 1.9
I0609 12:54:29.076989 21492 tensorflow.cc:2197] 'tensorflow' TRITONBACKEND API version: 1.9
I0609 12:54:29.076992 21492 tensorflow.cc:2221] backend configuration:
{}
I0609 12:54:29.186855 21492 onnxruntime.cc:2400] TRITONBACKEND_Initialize: onnxruntime
I0609 12:54:29.186876 21492 onnxruntime.cc:2410] Triton TRITONBACKEND API version: 1.9
I0609 12:54:29.186880 21492 onnxruntime.cc:2416] 'onnxruntime' TRITONBACKEND API version: 1.9
I0609 12:54:29.186884 21492 onnxruntime.cc:2446] backend configuration:
{}
I0609 12:54:29.236464 21492 openvino.cc:1207] TRITONBACKEND_Initialize: openvino
I0609 12:54:29.236483 21492 openvino.cc:1217] Triton TRITONBACKEND API version: 1.9
I0609 12:54:29.236488 21492 openvino.cc:1223] 'openvino' TRITONBACKEND API version: 1.9
I0609 12:54:30.318676 21492 pinned_memory_manager.cc:240] Pinned memory pool is created at '0x7fbd36000000' with size 268435456
I0609 12:54:30.319079 21492 cuda_memory_manager.cc:105] CUDA memory pool is created on device 0 with size 67108864
I0609 12:54:30.319094 21492 cuda_memory_manager.cc:105] CUDA memory pool is created on device 1 with size 67108864
W0609 12:54:31.182282 21492 server.cc:206] failed to enable peer access for some device pairs
I0609 12:54:31.184446 21492 model_repository_manager.cc:1077] loading: YOLOv5nCOCO:1
I0609 12:54:31.284776 21492 model_repository_manager.cc:1077] loading: YOLOv5nCROP:1
I0609 12:54:31.284788 21492 onnxruntime.cc:2481] TRITONBACKEND_ModelInitialize: YOLOv5nCOCO (version 1)
I0609 12:54:31.285700 21492 onnxruntime.cc:2524] TRITONBACKEND_ModelInstanceInitialize: YOLOv5nCOCO (GPU device 0)
I0609 12:54:31.386580 21492 model_repository_manager.cc:1077] loading: FCOS_detectron:1
I0609 12:54:32.209536 21492 onnxruntime.cc:2481] TRITONBACKEND_ModelInitialize: YOLOv5nCROP (version 1)
I0609 12:54:32.209978 21492 libtorch.cc:1430] TRITONBACKEND_ModelInitialize: FCOS_detectron (version 1)
I0609 12:54:32.210246 21492 libtorch.cc:293] Optimized execution is enabled for model instance 'FCOS_detectron'
I0609 12:54:32.210254 21492 libtorch.cc:311] Inference Mode is disabled for model instance 'FCOS_detectron'
I0609 12:54:32.210258 21492 libtorch.cc:406] NvFuser is not specified for model instance 'FCOS_detectron'
I0609 12:54:32.210272 21492 onnxruntime.cc:2524] TRITONBACKEND_ModelInstanceInitialize: YOLOv5nCOCO (GPU device 1)
I0609 12:54:33.016676 21492 onnxruntime.cc:2524] TRITONBACKEND_ModelInstanceInitialize: YOLOv5nCROP (GPU device 0)
I0609 12:54:33.017054 21492 model_repository_manager.cc:1231] successfully loaded 'YOLOv5nCOCO' version 1
I0609 12:54:33.093639 21492 libtorch.cc:1474] TRITONBACKEND_ModelInstanceInitialize: FCOS_detectron (GPU device 0)
I0609 12:54:33.450545 21492 onnxruntime.cc:2524] TRITONBACKEND_ModelInstanceInitialize: YOLOv5nCROP (GPU device 1)
I0609 12:54:33.503772 21492 libtorch.cc:1474] TRITONBACKEND_ModelInstanceInitialize: FCOS_detectron (GPU device 1)
I0609 12:54:33.504063 21492 model_repository_manager.cc:1231] successfully loaded 'YOLOv5nCROP' version 1
I0609 12:54:33.851533 21492 model_repository_manager.cc:1231] successfully loaded 'FCOS_detectron' version 1
I0609 12:54:33.851605 21492 server.cc:549]
+------------------+------+
| Repository Agent | Path |
+------------------+------+
+------------------+------+
I0609 12:54:33.851659 21492 server.cc:576]
+-------------+-------------------------------------------------------------------------+--------+
| Backend | Path | Config |
+-------------+-------------------------------------------------------------------------+--------+
| pytorch | /opt/tritonserver/backends/pytorch/libtriton_pytorch.so | {} |
| tensorflow | /opt/tritonserver/backends/tensorflow1/libtriton_tensorflow1.so | {} |
| onnxruntime | /opt/tritonserver/backends/onnxruntime/libtriton_onnxruntime.so | {} |
| openvino | /opt/tritonserver/backends/openvino_2021_4/libtriton_openvino_2021_4.so | {} |
+-------------+-------------------------------------------------------------------------+--------+
I0609 12:54:33.851703 21492 server.cc:619]
+----------------+---------+--------+
| Model | Version | Status |
+----------------+---------+--------+
| FCOS_detectron | 1 | READY |
| YOLOv5nCOCO | 1 | READY |
| YOLOv5nCROP | 1 | READY |
+----------------+---------+--------+
I0609 12:54:33.892306 21492 metrics.cc:650] Collecting metrics for GPU 0: NVIDIA GeForce RTX 3070
I0609 12:54:33.892335 21492 metrics.cc:650] Collecting metrics for GPU 1: NVIDIA GeForce RTX 3070
I0609 12:54:33.892721 21492 tritonserver.cc:2123]
+----------------------------------+--------------------------------------------------------------------------+
| Option | Value |
+----------------------------------+--------------------------------------------------------------------------+
| server_id | triton |
| server_version | 2.21.0 |
| server_extensions | classification sequence model_repository model_repository(unload_depende |
| | nts) schedule_policy model_configuration system_shared_memory cuda_share |
| | d_memory binary_tensor_data statistics trace |
| model_repository_path[0] | /opt/model_repo/ |
| model_control_mode | MODE_POLL |
| strict_model_config | 1 |
| rate_limit | OFF |
| pinned_memory_pool_byte_size | 268435456 |
| cuda_memory_pool_byte_size{0} | 67108864 |
| cuda_memory_pool_byte_size{1} | 67108864 |
| response_cache_byte_size | 0 |
| min_supported_compute_capability | 7.5 |
| strict_readiness | 1 |
| exit_timeout | 30 |
+----------------------------------+--------------------------------------------------------------------------+
Server Health: live 1, ready 1
Server Metadata:
{"name":"triton","version":"2.21.0","extensions":["classification","sequence","model_repository","model_repository(unload_dependents)","schedule_policy","model_configuration","system_shared_memory","cuda_shared_memory","binary_tensor_data","statistics","trace"]}
2022-06-09 12:54:35.981968507 [E:onnxruntime:log, cuda_call.cc:118 CudaCall] CUDNN failure 4: CUDNN_STATUS_INTERNAL_ERROR ; GPU=0 ; hostname=agrigaia-ws3-u ; expr=cudnnFindConvolutionForwardAlgorithmEx( s_.handle, s_.x_tensor, s_.x_data, s_.w_desc, s_.w_data, s_.conv_desc, s_.y_tensor, s_.y_data, 1, &algo_count, &perf, algo_search_workspace.get(), max_ws_size);
2022-06-09 12:54:35.981998167 [E:onnxruntime:, sequential_executor.cc:346 Execute] Non-zero status code returned while running Conv node. Name:'Conv_0' Status Message: CUDNN error executing cudnnFindConvolutionForwardAlgorithmEx( s_.handle, s_.x_tensor, s_.x_data, s_.w_desc, s_.w_data, s_.conv_desc, s_.y_tensor, s_.y_data, 1, &algo_count, &perf, algo_search_workspace.get(), max_ws_size)
2022-06-09 12:54:35.982022932 [E:onnxruntime:log, cuda_call.cc:118 CudaCall] CUDA failure 700: an illegal memory access was encountered ; GPU=0 ; hostname=agrigaia-ws3-u ; expr=cudaEventRecord(current_deferred_release_event, static_cast<cudaStream_t>(GetComputeStream()));
error: response status: Internal - onnx runtime error 1: Non-zero status code returned while running Conv node. Name:'Conv_0' Status Message: CUDNN error executing cudnnFindConvolutionForwardAlgorithmEx( s_.handle, s_.x_tensor, s_.x_data, s_.w_desc, s_.w_data, s_.conv_desc, s_.y_tensor, s_.y_data, 1, &algo_count, &perf, algo_search_workspace.get(), max_ws_size)
TRITON information I am working inside a docker dev container built on Triton-22.04 base image with ROS1 Noetic and Opencv 4.2.0 installed on top of it. nvcr.io/nvidia/tritonserver:22.04-py3 Ubuntu 20.04 Ros Noetic OpenCV 4.2.0 Model config file:
name: "YOLOv5nCOCO"
platform: "onnxruntime_onnx"
max_batch_size : 0
input [
{
name: "images"
data_type: TYPE_FP32
format: FORMAT_NCHW
dims: [ 3, 512, 512 ]
reshape { shape: [ 1, 3, 512, 512 ] }
}
]
output [
{
name: "output"
data_type: TYPE_FP32
dims: [1, 16128, 85]
}
]
Steps to reproduce I have written my main.cpp file based on your simple.cc example file. Unfortunately, I cannot share the full project Here are my preprocessing steps from receiving ros image topic till triton:
cv_ptr = cv_bridge::toCvCopy(msg, "rgb8");
// Store the values of the OpenCV-compatible image into the current_frame variable
cv::Mat current_frame = cv_ptr->image;
// Preprocessing of the image
// normalize the image and convert to
current_frame.convertTo(current_frame, CV_32F, 1.0/255.0, 0);
//resize the image to model input size
cv::resize(current_frame, current_frame, cv::Size(512, 512), cv::INTER_LINEAR);
// NCHW channels first
// cv::transpose(current_frame, current_frame);
// Convert Mat to Array/Vector in OpenCV https://stackoverflow.com/a/26685567
std::vector<float> input_data;
if (current_frame.isContinuous()) {
input_data.assign(current_frame.data,
current_frame.data + current_frame.total()*current_frame.channels());
} else {
for (int i = 0; i < current_frame.rows; ++i) {
input_data.insert(input_data.end(), current_frame.ptr<float>(i),
current_frame.ptr<float>(i) + current_frame.cols * current_frame.channels());
}
}
auto input = "images";
auto output = "output";
size_t input_size = input_data.size() * sizeof(float);
const TRITONSERVER_DataType input_datatype = TRITONSERVER_TYPE_FP32;
std::vector<int64_t> input_shape({current_frame.channels(), current_frame.rows, current_frame.cols});
const void* input_base = &input_data[0];
// Push data into Triton format
FAIL_IF_ERR(
TRITONSERVER_InferenceRequestAddInput(
irequest, input, input_datatype, &input_shape[0], input_shape.size()),
"setting input meta-data for the request");
FAIL_IF_ERR(
TRITONSERVER_InferenceRequestAppendInputData(
irequest, input, input_base, input_size, requested_memory_type,
0 /* memory_type_id */),
"assigning INPUT data");
FAIL_IF_ERR(
TRITONSERVER_InferenceRequestAddRequestedOutput(irequest, output),
"requesting output for the request");
// Triton connection
auto p = new std::promise<TRITONSERVER_InferenceResponse*>();
std::future<TRITONSERVER_InferenceResponse*> completed = p->get_future();
FAIL_IF_ERR(
TRITONSERVER_InferenceRequestSetResponseCallback(
irequest, allocator, nullptr /* response_allocator_userp */,
InferResponseComplete, reinterpret_cast<void*>(p)),
"setting response callback");
FAIL_IF_ERR(TRITONSERVER_ServerInferAsync(server->get(), irequest, nullptr /* trace */),"running inference");
TRITONSERVER_InferenceResponse* completed_response = completed.get();
FAIL_IF_ERR(TRITONSERVER_InferenceResponseError(completed_response),"response status");
//<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<< Here is the error
I tried running the same triton server from its executable and my model repository over a gRPC call to check if the ONNX export is correct. Everything works smoothly via gRPC. Some google links suggested that it might be an issue on RTX 20 series but I also reproduced the same error on RTX 3070. I can narrow it down to something wrong with my input image data but the error handling does not really specify the problem. Some insights would help a lot where to look exactly.
Issue Analytics
- State:
- Created a year ago
- Comments:7 (5 by maintainers)
Top GitHub Comments
Not sure what the cause of this issue is. Filing a bug with the team to understand the failure.
@niqbal996 If
requested_memory_type
is set to GPU for the inputs, you would need to copy the input tensor to GPU to make the error go away. To do so, you can refer to this part of simple.cc to see how you need to modify the script.