Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Segmentation fault during training (Volta, others)

See original GitHub issue

Training on TED as extracted from python ted.py ..., on AWS p3.2xlarge instance with CUDA 9.0, CuDNN 7.0.3, Ubuntu 16.04, and Python 3.5.4 results in Segmentation fault (core dumped) at some point during the first epoch (usually around 70-80% of the way through the batches), seemingly regardless of batch size (tried 32, 26, 12, and 4; also tried with p3.8xlarge and batch size 20). Worth mentioning, but I did not install MAGMA as per the pytorch conda installation instructions:

# Add LAPACK support for the GPU conda install -c soumith magma-cuda80 # or magma-cuda75 if CUDA 7.5

as it seems that the versions mentioned there are incompatible with CUDA 9.0.

Edit: last output from dmesg

[14531.790543] python[2191]: segfault at 100324c2400 ip 00007f165177a04a sp 00007f15c1c28c98 error 4 in libcuda.so.384.90[7f16515b2000+b1f000]

Issue Analytics

State:
Created 6 years ago
Comments:19 (9 by maintainers)

Top GitHub Comments

2reactions

ngimelcommented, Nov 8, 2017

For the first epoch, the batches are samples in increasing sequence length order, so you progressively need more and more memory. What happens is before the failing cudnn call there’s very little memory left on the GPU (in my tests, 300 K, you can confirm by adding

lib = ctypes.cdll.LoadLibrary(None)
free_mem = ctypes.c_long()
total_mem = ctypes.c_long()
lib.cudaMemGetInfo(ctypes.byref(free_mem), ctypes.byref(total_mem))
print("free mem", free_mem.value)

before cudnnRNNForwardTraining call. Cuda requires some free memory for internal operation (stack space for kernels, device memory for events etc) and when so little memory is left something is not handled correctly. (FWIW, same thing can happen on the CPU when you are pushing the memory limits, allocation can be reported as succeeded, and when you try to use it, you’d hang or segfault). My output for the code modified as above, run on a single GPU with the batch size of 64 is

Epoch: [1][515/1422]	Time 1.177 (0.583)	Data 0.008 (0.006)	Loss 172.8604 (119.4332)	
free mem 7252279296
reserve size 801177600 1105817763840 workspace_size 313344000 1107793281024 input size torch.Size([326, 64, 672])
free mem 7252279296
reserve size 801177600 1106620973056 workspace_size 315801600 1107793281024 input size torch.Size([326, 64, 800])
free mem 7252279296
reserve size 801177600 1108502118400 workspace_size 315801600 1107793281024 input size torch.Size([326, 64, 800])
free mem 7252279296
reserve size 801177600 1117253533696 workspace_size 315801600 1107793281024 input size torch.Size([326, 64, 800])
free mem 7252279296
reserve size 801177600 1118056742912 workspace_size 315801600 1107926843392 input size torch.Size([326, 64, 800])
Epoch: [1][516/1422]	Time 0.915 (0.584)	Data 0.008 (0.006)	Loss 151.3297 (119.4950)	
free mem 6446972928
reserve size 803635200 1109305327616 workspace_size 314163200 1107793281024 input size torch.Size([327, 64, 672])
free mem 5641666560
reserve size 803635200 1110110633984 workspace_size 316620800 1107793281024 input size torch.Size([327, 64, 800])
free mem 4836360192
reserve size 803635200 1110915940352 workspace_size 316620800 1107793281024 input size torch.Size([327, 64, 800])
free mem 4031053824
reserve size 803635200 1111721246720 workspace_size 316620800 1107793281024 input size torch.Size([327, 64, 800])
free mem 3225747456
reserve size 803635200 1112526553088 workspace_size 316620800 1107927236608 input size torch.Size([327, 64, 800])
Epoch: [1][517/1422]	Time 0.930 (0.585)	Data 0.007 (0.006)	Loss 165.3099 (119.5836)	
free mem 2420441088
reserve size 803635200 1109305327616 workspace_size 314163200 1107793281024 input size torch.Size([327, 64, 672])
free mem 2420441088
reserve size 803635200 1110110633984 workspace_size 316620800 1107793281024 input size torch.Size([327, 64, 800])
free mem 2420441088
reserve size 803635200 1110915940352 workspace_size 316620800 1107793281024 input size torch.Size([327, 64, 800])
free mem 2420441088
reserve size 803635200 1111721246720 workspace_size 316620800 1107793281024 input size torch.Size([327, 64, 800])
free mem 2420441088
reserve size 803635200 1112526553088 workspace_size 316620800 1107927236608 input size torch.Size([327, 64, 800])
Epoch: [1][518/1422]	Time 0.919 (0.585)	Data 0.008 (0.006)	Loss 179.5324 (119.6994)	
free mem 1613037568
reserve size 806092800 1114137165824 workspace_size 314982400 1107793281024 input size torch.Size([328, 64, 672])
free mem 805634048
reserve size 806092800 1114944569344 workspace_size 317440000 1107793281024 input size torch.Size([328, 64, 800])
free mem 327680
reserve size 806092800 1115751972864 workspace_size 317440000 1107793281024 input size torch.Size([328, 64, 800])
Segmentation fault (core dumped)

Ignore “reserve size” lines, I was checking that allocations for reserve and workspace look ok. So, there are a few things to blame for this outcome

CUDA should handle oom more gracefully than just segfaulting.
IIRC, there are still some unresolved issues with pytorch using more memory for RNNs than it should
Monotonically increasing sequences, as in this example, are particularly bad for caching allocator, it cannot reuse blocks and is forced to constantly free and reallocate them.
Tedlium dataset has pretty long sequences, I segfaulted at less than 40% of the epoch, and sequences were of length 328 already.

Emptying memory allocator cache was exposed recently in the upstream pytorch, may be doing that, or collecting garbage before each iteration will allow you to move a bit further. Or using smaller batch, or using smaller hidden state of LSTM, in short, all the standard things that you would do to try to fit your problem in memory.

0reactions

stale[bot]commented, Feb 27, 2020

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

Top Results From Across the Web

Identify what's causing segmentation faults (segfaults)

A segmentation fault (aka segfault) is a common condition that causes programs to crash; they are often associated with a file named core...

segmentation fault ( core dumped ) - Stack Overflow

Why do you try to add characters (or integers, characters are integers in C) with a const array ( "." )? This makes...

Frequently Asked Questions — MMDetection 2.17.0 documentation

The common issue is nvcc fatal : Unsupported gpu architecture 'compute_86' . This means that the compiler should optimize for sm_86, i.e., nvidia...

Learning about segmentation faults in C++ - Educative.io

Segmentation faults can occur due to multiple reasons; overflowing the stack allocated during recursion, accessing array elements with out of bound indexes, ...

Segmentation fault - Wikipedia

In computing, a segmentation fault (often shortened to segfault) or access violation is a fault, or failure condition, raised by hardware with memory ......