SIGSEGV error while trying to train the Levenshtein transformer
See original GitHub issueI’m trying to train the Levenstein transformer with the suggested dataset and settings (but max-tokens set to 4000) on 1 machine with 4 V100 32GB GPUs. I’m using pytorch 1.2 and python 3.6, on a Scientific Linux 7.6 distribution.
The process always fails with:
| model levenshtein_transformer, criterion LabelSmoothedDualImitationCriterion
| num. model params: 66251776 (num. trained: 66251776)
| training on 4 GPUs
| max tokens per GPU = 4000 and max sentences per GPU = None
| no existing checkpoint found checkpoints/levt/checkpoint_last.pt
| loading train data for epoch 0
| loaded 3961179 examples from: data-bin/joint-bpe-37k/train.en-de.en
| loaded 3961179 examples from: data-bin/joint-bpe-37k/train.en-de.de
| data-bin/joint-bpe-37k train en-de 3961179 examples
Traceback (most recent call last):
File "<string>", line 1, in <module>
File "/appl/python/3.6.2/lib/python3.6/multiprocessing/spawn.py", line 105, in spawn_main
exitcode = _main(fd)
File "/appl/python/3.6.2/lib/python3.6/multiprocessing/spawn.py", line 115, in _main
self = reduction.pickle.load(from_parent)
_pickle.UnpicklingError: pickle data was truncated
Traceback (most recent call last):
File "fairseq_cli/train.py", line 342, in <module>
cli_main()
File "fairseq_cli/train.py", line 334, in cli_main
nprocs=args.distributed_world_size,
File "/zhome/60/6/124738/.local/lib/python3.6/site-packages/torch/multiprocessing/spawn.py", line 171, in spawn
while not spawn_context.join():
File "/zhome/60/6/124738/.local/lib/python3.6/site-packages/torch/multiprocessing/spawn.py", line 107, in join
(error_index, name)
Exception: process 1 terminated with signal SIGSEGV
~/repos/fairseq
n-62-20-9(s172185) $
~/repos/fairseq
n-62-20-9(s172185) $ Traceback (most recent call last):
File "<string>", line 1, in <module>
File "/appl/python/3.6.2/lib/python3.6/multiprocessing/spawn.py", line 105, in spawn_main
exitcode = _main(fd)
File "/appl/python/3.6.2/lib/python3.6/multiprocessing/spawn.py", line 115, in _main
self = reduction.pickle.load(from_parent)
_pickle.UnpicklingError: pickle data was truncated
/appl/python/3.6.2/lib/python3.6/multiprocessing/semaphore_tracker.py:129: UserWarning: semaphore_tracker: There appear to be 34 leaked semaphores to clean up at shutdown
len(cache))
When I try to run it on 1 GPU with the --distributed-world-size 1
setting then I simply get a Segmentation fault
without any traceback.
What could cause this issue?
Thank you!
Issue Analytics
- State:
- Created 4 years ago
- Comments:8 (2 by maintainers)
Top Results From Across the Web
SIGSEGV error while trying to train the Levenshtein transformer
I'm trying to train the Levenstein transformer with the suggested dataset and settings (but max-tokens set to 4000) on 1 machine with 4...
Read more >trax-ml/community - Gitter
Moving average training log-ppl was used, so the curve looks a bit weird, but this trend seems universal, except for the slight increase...
Read more >What to do when you get an error - Hugging Face Course
In this section we'll look at some common errors that can occur when you're trying to generate predictions from your freshly tuned Transformer...
Read more >sitemap-questions-62.xml - Stack Overflow
... 2021-06-28 https://stackoverflow.com/questions/7426851/trying-to-use-selenium-2-with-python-bindings-but-im-getting-an-import-error 2021-06-23 ...
Read more >AIBench User Manual - BenchCouncil
reconstruction, text summarization, spatial transformer, and learning to rank. We implement sixteen component benchmarks for those AI ...
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
Please refer https://github.com/pytorch/fairseq/issues/1350#issuecomment-550476700
Thank you for the suggestions @gvskalyan. Unfortunately the same error occurs with pytorch 1.1.0 as well.