question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Segmentation fault when using 4 GPUs for training

See original GitHub issue

Specs:

Python version: 3.6.8
Pytorch version: 1.4.0
4 v100 GPUs
Cuda version: 10.1
Nvidia Driver Version: 418.87.00 

I added thefollowing line in dlrm_s_pytorch.py

import faulthandler; faulthandler.enable()

and used the following command to run the code

python3 -X faulthandler dlrm_s_pytorch.py --arch-sparse-feature-size=16 --arch-mlp-bot="13-512-256-64-16" --arch-mlp-top="512-256-1" --data-generation=dataset --data-set=kaggle --raw-data-file=/path-to-data --processed-data-file=/path-to-npz-file --loss-function=bce --round-targets=True --learning-rate=0.1 --mini-batch-size=64 --test-freq 0 --print-freq=1024 --print-time --use-gpu

It executes for some iterations(variable: different during different runs) and then fails with a segmentation fault. Here is a sample output

Using 4 GPU(s)...
Reading pre-processed data=/users/ushmal/kaggleAdDisplayChallenge_processed.npz
Sparse features= 26, Dense features= 13
Reading pre-processed data=/users/ushmal/kaggleAdDisplayChallenge_processed.npz
Sparse features= 26, Dense features= 13
time/loss/accuracy (if enabled):
Finished training it 1024/613937 of epoch 0, 51.35 ms/it, loss 0.520202, accuracy 75.478 %
Finished training it 2048/613937 of epoch 0, 28.98 ms/it, loss 0.506464, accuracy 76.196 %
Finished training it 3072/613937 of epoch 0, 29.48 ms/it, loss 0.505029, accuracy 76.314 %
Finished training it 4096/613937 of epoch 0, 30.34 ms/it, loss 0.494111, accuracy 76.935 %
Finished training it 5120/613937 of epoch 0, 30.36 ms/it, loss 0.496054, accuracy 76.781 %
Finished training it 6144/613937 of epoch 0, 30.44 ms/it, loss 0.487835, accuracy 77.235 %
Finished training it 7168/613937 of epoch 0, 30.65 ms/it, loss 0.486214, accuracy 77.292 %
Fatal Python error: Segmentation fault

Thread 0x00007f64c1a25700 (most recent call first):

Thread 0x00007f64c2226700 (most recent call first):

Current thread 0x00007f64c2a27700 (most recent call first):

Thread 0x00007f64c3a29700 (most recent call first):
  File "/users/ushmal/.local/lib/python3.6/site-packages/torch/cuda/comm.py", line 165 in gather
  File "/users/ushmal/.local/lib/python3.6/site-packages/torch/nn/parallel/_functions.py", line 68 in forward
  File "/users/ushmal/.local/lib/python3.6/site-packages/torch/nn/parallel/_functions.py", line 101 in backward
  File "/users/ushmal/.local/lib/python3.6/site-packages/torch/autograd/function.py", line 77 in apply

Thread 0x00007f64c3228700 (most recent call first):

Thread 0x00007f65f70b2740 (most recent call first):
  File "/users/ushmal/.local/lib/python3.6/site-packages/torch/autograd/__init__.py", line 99 in backward
  File "/users/ushmal/.local/lib/python3.6/site-packages/torch/tensor.py", line 195 in backward
  File "dlrm_s_pytorch.py", line 814 in <module>
Segmentation fault (core dumped)

When using 2 GPUs or a single GPU the segmentation fault doesn’t arise even after 100000 iterations. Thanks!

Issue Analytics

  • State:closed
  • Created 4 years ago
  • Comments:12 (5 by maintainers)

github_iconTop GitHub Comments

1reaction
Adamitscommented, May 22, 2020

Is this fix in the pytorch 1.4.0 pypy distribution? I am getting pretty much the same backtrace when running backward() on 4 GPUs with DataParallel. Is there a build that I should be using? I am having an unrelated issue training my model with pytorch 1.5.0

1reaction
thumbe3commented, Jan 16, 2020

Thanks for solving the issue.!I currently don’t have the resources to rerun the experiment. I am closing the issu since it seems like the issue is fixed.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Tensorflow segmentation fault with single machine multiple ...
Recently, I am trying to learn how to use Tensorflow to do the data parallel training and I found a toy example here...
Read more >
Multigpu, Segmentation fault - PyTorch Forums
I encountered this problem when training with multi gpu after a few epochs. Sometimes the error is “Segmentation fault (core dumped)”.
Read more >
Segmentation fault at training network - Jetson TX2
I am getting segmentation fault while training my neural network. $ python tools/train_lanenet.py. The output is as following:.
Read more >
Frequently Asked Questions - MMDetection's documentation!
This uses the sublinear strategy in PyTorch to reduce GPU memory cost in the backbone. Try mixed precision training using following the examples...
Read more >
Efficient Training on a Single GPU - Hugging Face
If that's not the case on your machine make sure to stop all processes that are using GPU memory. However, not all free...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found