Segmentation fault during training (Volta, others)
See original GitHub issueTraining on TED as extracted from python ted.py ...
, on AWS p3.2xlarge instance with CUDA 9.0, CuDNN 7.0.3, Ubuntu 16.04, and Python 3.5.4 results in Segmentation fault (core dumped)
at some point during the first epoch (usually around 70-80% of the way through the batches), seemingly regardless of batch size (tried 32, 26, 12, and 4; also tried with p3.8xlarge and batch size 20). Worth mentioning, but I did not install MAGMA as per the pytorch conda installation instructions:
# Add LAPACK support for the GPU conda install -c soumith magma-cuda80 # or magma-cuda75 if CUDA 7.5
as it seems that the versions mentioned there are incompatible with CUDA 9.0.
Edit: last output from dmesg
[14531.790543] python[2191]: segfault at 100324c2400 ip 00007f165177a04a sp 00007f15c1c28c98 error 4 in libcuda.so.384.90[7f16515b2000+b1f000]
Issue Analytics
- State:
- Created 6 years ago
- Comments:19 (9 by maintainers)
Top GitHub Comments
For the first epoch, the batches are samples in increasing sequence length order, so you progressively need more and more memory. What happens is before the failing cudnn call there’s very little memory left on the GPU (in my tests, 300 K, you can confirm by adding
before cudnnRNNForwardTraining call. Cuda requires some free memory for internal operation (stack space for kernels, device memory for events etc) and when so little memory is left something is not handled correctly. (FWIW, same thing can happen on the CPU when you are pushing the memory limits, allocation can be reported as succeeded, and when you try to use it, you’d hang or segfault). My output for the code modified as above, run on a single GPU with the batch size of 64 is
Ignore “reserve size” lines, I was checking that allocations for reserve and workspace look ok. So, there are a few things to blame for this outcome
Emptying memory allocator cache was exposed recently in the upstream pytorch, may be doing that, or collecting garbage before each iteration will allow you to move a bit further. Or using smaller batch, or using smaller hidden state of LSTM, in short, all the standard things that you would do to try to fit your problem in memory.
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.