Crash when using multi_gpu_model and n_sample is not a multiple of batch_size
See original GitHub issueI had this error when trying to fit a multi_gpu_model
that fits just fine on a single GPU:
F tensorflow/stream_executor/cuda/cuda_dnn.cc:522] Check failed: cudnnSetTensorNdDescriptor(handle_.get(), elem_type, nd, dims.data(), strides.data()) == CUDNN_STATUS_SUCCESS (3 vs. 0)batch_descriptor: {count: 0 feature_map_count: 16 spatial: 128 128 128 value_min: 0.000000 value_max: 0.000000 layout: BatchDepthYX}
After some investigations (because I was trying to fill a decent bug report), it turns out this happens when I try to fit it using a batch_size that is not a divisor of the number of samples in my dataset.
I originally asked for help on SO here, where you can read more details. Maybe this cannot be changed, but my suggestion would be to have a more intelligible error message for a regular human like me. ;o)
Issue Analytics
- State:
- Created 5 years ago
- Reactions:8
- Comments:21
Top Results From Across the Web
Keras multi_gpu_model causes system to crash - Stack Overflow
Keras uses all the 4 GPUs computation and the code compilation can be made with CPU. You can try the below code. For...
Read more >Effect of batch size and number of GPUs on model accuracy
In the case of multiple GPUs, the rule of thumb will be using at least 16 (or so) batch size per GPU, given...
Read more >Why My Multi-GPU training is slow? | by Chuan Li | Medium
In order to NOT benefit from multiple GPUs, you can: Use a very small network. Use very small batch size. Create overhead for...
Read more >Speeding up training — ParlAI Documentation
This tutorial walks you through a few ways to massively speed up your training runs in ParlAI. These tricks tend to work best...
Read more >How-To: Multi-GPU training with Keras, Python, and deep ...
In this tutorial you'll learn how you can scale Keras and train deep neural network using multiple GPUs with the Keras deep learning...
Read more >
Top Related Medium Post
No results found
Top Related StackOverflow Question
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free
Top Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
I think it’s worth noting, since this issue pops up first when googling this error: it also happens in general case, when one tries to pass too small spatial input, presumably, for pooling layers.
I am also having this problem if anyone can help. When my code reaches the train_on_batch(X,Y) i get this error:
2019-03-17 19:16:58.883468: F tensorflow/stream_executor/cuda/cuda_dnn.cc:542] Check failed: cudnnSetTensorNdDescriptor(handle_.get(), elem_type, nd, dims.data(), strides.data()) == CUDNN_STATUS_SUCCESS (3 vs. 0)batch_descriptor: {count: 2 feature_map_count: 76 spatial: 0 120 value_min: 0.000000 value_max: 0.000000 layout: BatchDepthYX}
Than my code crashes.