Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Segmentation fault when using 4 GPUs for training

See original GitHub issue

Specs:

Python version: 3.6.8
Pytorch version: 1.4.0
4 v100 GPUs
Cuda version: 10.1
Nvidia Driver Version: 418.87.00

I added thefollowing line in dlrm_s_pytorch.py

import faulthandler; faulthandler.enable()

and used the following command to run the code

python3 -X faulthandler dlrm_s_pytorch.py --arch-sparse-feature-size=16 --arch-mlp-bot="13-512-256-64-16" --arch-mlp-top="512-256-1" --data-generation=dataset --data-set=kaggle --raw-data-file=/path-to-data --processed-data-file=/path-to-npz-file --loss-function=bce --round-targets=True --learning-rate=0.1 --mini-batch-size=64 --test-freq 0 --print-freq=1024 --print-time --use-gpu

It executes for some iterations(variable: different during different runs) and then fails with a segmentation fault. Here is a sample output

Using 4 GPU(s)...
Reading pre-processed data=/users/ushmal/kaggleAdDisplayChallenge_processed.npz
Sparse features= 26, Dense features= 13
Reading pre-processed data=/users/ushmal/kaggleAdDisplayChallenge_processed.npz
Sparse features= 26, Dense features= 13
time/loss/accuracy (if enabled):
Finished training it 1024/613937 of epoch 0, 51.35 ms/it, loss 0.520202, accuracy 75.478 %
Finished training it 2048/613937 of epoch 0, 28.98 ms/it, loss 0.506464, accuracy 76.196 %
Finished training it 3072/613937 of epoch 0, 29.48 ms/it, loss 0.505029, accuracy 76.314 %
Finished training it 4096/613937 of epoch 0, 30.34 ms/it, loss 0.494111, accuracy 76.935 %
Finished training it 5120/613937 of epoch 0, 30.36 ms/it, loss 0.496054, accuracy 76.781 %
Finished training it 6144/613937 of epoch 0, 30.44 ms/it, loss 0.487835, accuracy 77.235 %
Finished training it 7168/613937 of epoch 0, 30.65 ms/it, loss 0.486214, accuracy 77.292 %
Fatal Python error: Segmentation fault

Thread 0x00007f64c1a25700 (most recent call first):

Thread 0x00007f64c2226700 (most recent call first):

Current thread 0x00007f64c2a27700 (most recent call first):

Thread 0x00007f64c3a29700 (most recent call first):
  File "/users/ushmal/.local/lib/python3.6/site-packages/torch/cuda/comm.py", line 165 in gather
  File "/users/ushmal/.local/lib/python3.6/site-packages/torch/nn/parallel/_functions.py", line 68 in forward
  File "/users/ushmal/.local/lib/python3.6/site-packages/torch/nn/parallel/_functions.py", line 101 in backward
  File "/users/ushmal/.local/lib/python3.6/site-packages/torch/autograd/function.py", line 77 in apply

Thread 0x00007f64c3228700 (most recent call first):

Thread 0x00007f65f70b2740 (most recent call first):
  File "/users/ushmal/.local/lib/python3.6/site-packages/torch/autograd/__init__.py", line 99 in backward
  File "/users/ushmal/.local/lib/python3.6/site-packages/torch/tensor.py", line 195 in backward
  File "dlrm_s_pytorch.py", line 814 in <module>
Segmentation fault (core dumped)

When using 2 GPUs or a single GPU the segmentation fault doesn’t arise even after 100000 iterations. Thanks!

Issue Analytics

State:
Created 4 years ago
Comments:12 (5 by maintainers)

Top GitHub Comments

1reaction

Adamitscommented, May 22, 2020

Is this fix in the pytorch 1.4.0 pypy distribution? I am getting pretty much the same backtrace when running backward() on 4 GPUs with DataParallel. Is there a build that I should be using? I am having an unrelated issue training my model with pytorch 1.5.0

1reaction

thumbe3commented, Jan 16, 2020

Thanks for solving the issue.!I currently don’t have the resources to rerun the experiment. I am closing the issu since it seems like the issue is fixed.