Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Crash when using multi_gpu_model and n_sample is not a multiple of batch_size

See original GitHub issue

I had this error when trying to fit a multi_gpu_model that fits just fine on a single GPU:

F tensorflow/stream_executor/cuda/cuda_dnn.cc:522] Check failed: cudnnSetTensorNdDescriptor(handle_.get(), elem_type, nd, dims.data(), strides.data()) == CUDNN_STATUS_SUCCESS (3 vs. 0)batch_descriptor: {count: 0 feature_map_count: 16 spatial: 128 128 128  value_min: 0.000000 value_max: 0.000000 layout: BatchDepthYX}

After some investigations (because I was trying to fill a decent bug report), it turns out this happens when I try to fit it using a batch_size that is not a divisor of the number of samples in my dataset.

I originally asked for help on SO here, where you can read more details. Maybe this cannot be changed, but my suggestion would be to have a more intelligible error message for a regular human like me. ;o)

Issue Analytics

State:
Created 5 years ago
Reactions:8
Comments:21

Top GitHub Comments

6reactions

apatsekincommented, Feb 14, 2019

I think it’s worth noting, since this issue pops up first when googling this error: it also happens in general case, when one tries to pass too small spatial input, presumably, for pooling layers.

3reactions

nohaghatwarycommented, Mar 17, 2019

I am also having this problem if anyone can help. When my code reaches the train_on_batch(X,Y) i get this error:

2019-03-17 19:16:58.883468: F tensorflow/stream_executor/cuda/cuda_dnn.cc:542] Check failed: cudnnSetTensorNdDescriptor(handle_.get(), elem_type, nd, dims.data(), strides.data()) == CUDNN_STATUS_SUCCESS (3 vs. 0)batch_descriptor: {count: 2 feature_map_count: 76 spatial: 0 120 value_min: 0.000000 value_max: 0.000000 layout: BatchDepthYX}

Than my code crashes.