question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. Itย collects links to all the places you might be looking at while hunting down a tough bug.

And, if youโ€™re still stuck at the end, weโ€™re happy to hop on a call to see how we can help out.

seg fault issue by gpu memory

See original GitHub issue

Hello,

Iโ€™m using single GTX1070 with 8GB GDDR5, and trying to train deepspeech.pytorch with TEDLIUM corpus. However, a few trials were failed by seg fault, which I guess itโ€™s originated from the OOM issue. Iโ€™ve tried to reduce the batch size, but now I found another parameter --num_workers. I wonder which parameter can be more profitable to manage OOM issue. Could you give me any guide for this?

Epoch: [1][10823/11373] Time 0.504 (0.243)      Data 0.011 (0.021)      Loss 218.2781 (164.8313)                                                                         โ”‚ยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยท
Epoch: [1][10824/11373] Time 0.512 (0.244)      Data 0.011 (0.021)      Loss 244.8923 (164.8387)                                                                         โ”‚ยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยท
Epoch: [1][10825/11373] Time 0.503 (0.244)      Data 0.012 (0.021)      Loss 233.4698 (164.8451)                                                                         โ”‚ยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยท
./train.sh: line 12: 35528 Segmentation fault      (core dumped) python train.py --train_manifest data/ted/ted_train_manifest.csv --val data/ted/ted_val_manifest.csv --sโ”‚ยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยท
ample_rate 8000 --augment --batch_size 8 --epochs 100 --cuda --checkpoint --save_folder models/20170823            
```                                                      โ”‚

Issue Analytics

  • State:closed
  • Created 6 years ago
  • Comments:6 (4 by maintainers)

github_iconTop GitHub Comments

1reaction
dlmacedocommented, Aug 26, 2017

If you activate the warnings of the PyTorch 0.2.0, you receive something like this:

/home/dlm/code/deepspeech.pytorch/model.py:63: UserWarning: RNN module weights are not part of single contiguous chunk of memory. This means they need to be compacted at every call, possibly greately increasing memory usage. To compact weights again call flatten_parameters().

The warming indicates that our code works but that it is not optimal of a memory allocation point of view.

I guess if we change the code in order to follow the recommendation above, we will probably solve lots of out of memory related problems.

0reactions
jinserkcommented, Aug 30, 2017

Well, I have no reference to notice any slowdown issue since all previous training was failed by seg fault. My data is as big as taking almost half day for single epoch, and I donโ€™t feel that it is particularly slower than before.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Segmentation fault when GPUs are already used #152 - GitHub
When I set the nvidia driver in exclusive mode and one of the GPU is already used by another process, I get a...
Read more >
Segmentation Fault when using GPU - Google Groups
googlegroups.com. Everything seems to work fine with the CPU but I get seg faults with the GPU. rescomp-12-250088:Project Brett$ python gputest.py.
Read more >
What causes this segmentation fault (core dumped) error at ...
What causes this segmentation fault (core dumped) error at cudaMemcpy when copying to GPU? ยท 1 ยท As a simple fix, just delete...
Read more >
Segmentation faults and illegal memory address accesses ...
A segfault (segmentation fault; Windows: general protection fault) is something that occurs in host code running on the CPU and most likelyย ...
Read more >
Why do I get a segmentation fault for memory checking?
device = torch.device("cuda:0") In [2]: In [2]: memory = torch.cuda.memory_allocated(device) Segmentation fault (core dumped). And my GPU ...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found