Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

seg fault issue by gpu memory

See original GitHub issue

Hello,

I’m using single GTX1070 with 8GB GDDR5, and trying to train deepspeech.pytorch with TEDLIUM corpus. However, a few trials were failed by seg fault, which I guess it’s originated from the OOM issue. I’ve tried to reduce the batch size, but now I found another parameter --num_workers. I wonder which parameter can be more profitable to manage OOM issue. Could you give me any guide for this?

Epoch: [1][10823/11373] Time 0.504 (0.243)      Data 0.011 (0.021)      Loss 218.2781 (164.8313)                                                                         │···························································
Epoch: [1][10824/11373] Time 0.512 (0.244)      Data 0.011 (0.021)      Loss 244.8923 (164.8387)                                                                         │···························································
Epoch: [1][10825/11373] Time 0.503 (0.244)      Data 0.012 (0.021)      Loss 233.4698 (164.8451)                                                                         │···························································
./train.sh: line 12: 35528 Segmentation fault      (core dumped) python train.py --train_manifest data/ted/ted_train_manifest.csv --val data/ted/ted_val_manifest.csv --s│···························································
ample_rate 8000 --augment --batch_size 8 --epochs 100 --cuda --checkpoint --save_folder models/20170823            
```                                                      │

Issue Analytics

State:
Created 6 years ago
Comments:6 (4 by maintainers)

Top GitHub Comments

1reaction

dlmacedocommented, Aug 26, 2017

If you activate the warnings of the PyTorch 0.2.0, you receive something like this:

/home/dlm/code/deepspeech.pytorch/model.py:63: UserWarning: RNN module weights are not part of single contiguous chunk of memory. This means they need to be compacted at every call, possibly greately increasing memory usage. To compact weights again call flatten_parameters().

The warming indicates that our code works but that it is not optimal of a memory allocation point of view.

I guess if we change the code in order to follow the recommendation above, we will probably solve lots of out of memory related problems.

0reactions

jinserkcommented, Aug 30, 2017

Well, I have no reference to notice any slowdown issue since all previous training was failed by seg fault. My data is as big as taking almost half day for single epoch, and I don’t feel that it is particularly slower than before.