Segmentation fault when using 4 GPUs for training
See original GitHub issueSpecs:
Python version: 3.6.8
Pytorch version: 1.4.0
4 v100 GPUs
Cuda version: 10.1
Nvidia Driver Version: 418.87.00
I added thefollowing line in dlrm_s_pytorch.py
import faulthandler; faulthandler.enable()
and used the following command to run the code
python3 -X faulthandler dlrm_s_pytorch.py --arch-sparse-feature-size=16 --arch-mlp-bot="13-512-256-64-16" --arch-mlp-top="512-256-1" --data-generation=dataset --data-set=kaggle --raw-data-file=/path-to-data --processed-data-file=/path-to-npz-file --loss-function=bce --round-targets=True --learning-rate=0.1 --mini-batch-size=64 --test-freq 0 --print-freq=1024 --print-time --use-gpu
It executes for some iterations(variable: different during different runs) and then fails with a segmentation fault. Here is a sample output
Using 4 GPU(s)...
Reading pre-processed data=/users/ushmal/kaggleAdDisplayChallenge_processed.npz
Sparse features= 26, Dense features= 13
Reading pre-processed data=/users/ushmal/kaggleAdDisplayChallenge_processed.npz
Sparse features= 26, Dense features= 13
time/loss/accuracy (if enabled):
Finished training it 1024/613937 of epoch 0, 51.35 ms/it, loss 0.520202, accuracy 75.478 %
Finished training it 2048/613937 of epoch 0, 28.98 ms/it, loss 0.506464, accuracy 76.196 %
Finished training it 3072/613937 of epoch 0, 29.48 ms/it, loss 0.505029, accuracy 76.314 %
Finished training it 4096/613937 of epoch 0, 30.34 ms/it, loss 0.494111, accuracy 76.935 %
Finished training it 5120/613937 of epoch 0, 30.36 ms/it, loss 0.496054, accuracy 76.781 %
Finished training it 6144/613937 of epoch 0, 30.44 ms/it, loss 0.487835, accuracy 77.235 %
Finished training it 7168/613937 of epoch 0, 30.65 ms/it, loss 0.486214, accuracy 77.292 %
Fatal Python error: Segmentation fault
Thread 0x00007f64c1a25700 (most recent call first):
Thread 0x00007f64c2226700 (most recent call first):
Current thread 0x00007f64c2a27700 (most recent call first):
Thread 0x00007f64c3a29700 (most recent call first):
File "/users/ushmal/.local/lib/python3.6/site-packages/torch/cuda/comm.py", line 165 in gather
File "/users/ushmal/.local/lib/python3.6/site-packages/torch/nn/parallel/_functions.py", line 68 in forward
File "/users/ushmal/.local/lib/python3.6/site-packages/torch/nn/parallel/_functions.py", line 101 in backward
File "/users/ushmal/.local/lib/python3.6/site-packages/torch/autograd/function.py", line 77 in apply
Thread 0x00007f64c3228700 (most recent call first):
Thread 0x00007f65f70b2740 (most recent call first):
File "/users/ushmal/.local/lib/python3.6/site-packages/torch/autograd/__init__.py", line 99 in backward
File "/users/ushmal/.local/lib/python3.6/site-packages/torch/tensor.py", line 195 in backward
File "dlrm_s_pytorch.py", line 814 in <module>
Segmentation fault (core dumped)
When using 2 GPUs or a single GPU the segmentation fault doesn’t arise even after 100000 iterations. Thanks!
Issue Analytics
- State:
- Created 4 years ago
- Comments:12 (5 by maintainers)
Top GitHub Comments
Is this fix in the pytorch 1.4.0 pypy distribution? I am getting pretty much the same backtrace when running
backward()
on 4 GPUs withDataParallel
. Is there a build that I should be using? I am having an unrelated issue training my model with pytorch 1.5.0Thanks for solving the issue.!I currently don’t have the resources to rerun the experiment. I am closing the issu since it seems like the issue is fixed.