Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

SIGSEGV error while trying to train the Levenshtein transformer

See original GitHub issue

I’m trying to train the Levenstein transformer with the suggested dataset and settings (but max-tokens set to 4000) on 1 machine with 4 V100 32GB GPUs. I’m using pytorch 1.2 and python 3.6, on a Scientific Linux 7.6 distribution.

The process always fails with:

| model levenshtein_transformer, criterion LabelSmoothedDualImitationCriterion
| num. model params: 66251776 (num. trained: 66251776)
| training on 4 GPUs
| max tokens per GPU = 4000 and max sentences per GPU = None
| no existing checkpoint found checkpoints/levt/checkpoint_last.pt
| loading train data for epoch 0
| loaded 3961179 examples from: data-bin/joint-bpe-37k/train.en-de.en
| loaded 3961179 examples from: data-bin/joint-bpe-37k/train.en-de.de
| data-bin/joint-bpe-37k train en-de 3961179 examples
Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "/appl/python/3.6.2/lib/python3.6/multiprocessing/spawn.py", line 105, in spawn_main
    exitcode = _main(fd)
  File "/appl/python/3.6.2/lib/python3.6/multiprocessing/spawn.py", line 115, in _main
    self = reduction.pickle.load(from_parent)
_pickle.UnpicklingError: pickle data was truncated
Traceback (most recent call last):
  File "fairseq_cli/train.py", line 342, in <module>
    cli_main()
  File "fairseq_cli/train.py", line 334, in cli_main
    nprocs=args.distributed_world_size,
  File "/zhome/60/6/124738/.local/lib/python3.6/site-packages/torch/multiprocessing/spawn.py", line 171, in spawn
    while not spawn_context.join():
  File "/zhome/60/6/124738/.local/lib/python3.6/site-packages/torch/multiprocessing/spawn.py", line 107, in join
    (error_index, name)
Exception: process 1 terminated with signal SIGSEGV
~/repos/fairseq
n-62-20-9(s172185) $ 
~/repos/fairseq
n-62-20-9(s172185) $ Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "/appl/python/3.6.2/lib/python3.6/multiprocessing/spawn.py", line 105, in spawn_main
    exitcode = _main(fd)
  File "/appl/python/3.6.2/lib/python3.6/multiprocessing/spawn.py", line 115, in _main
    self = reduction.pickle.load(from_parent)
_pickle.UnpicklingError: pickle data was truncated
/appl/python/3.6.2/lib/python3.6/multiprocessing/semaphore_tracker.py:129: UserWarning: semaphore_tracker: There appear to be 34 leaked semaphores to clean up at shutdown
  len(cache))

When I try to run it on 1 GPU with the --distributed-world-size 1 setting then I simply get a Segmentation fault without any traceback.

What could cause this issue?

Thank you!

Issue Analytics

State:
Created 4 years ago
Comments:8 (2 by maintainers)

Top GitHub Comments

2reactions

gvskalyancommented, Nov 12, 2019

Please refer https://github.com/pytorch/fairseq/issues/1350#issuecomment-550476700

2reactions

danhorvathcommented, Nov 1, 2019

Thank you for the suggestions @gvskalyan. Unfortunately the same error occurs with pytorch 1.1.0 as well.

Top Results From Across the Web

SIGSEGV error while trying to train the Levenshtein transformer

I'm trying to train the Levenstein transformer with the suggested dataset and settings (but max-tokens set to 4000) on 1 machine with 4...

trax-ml/community - Gitter

Moving average training log-ppl was used, so the curve looks a bit weird, but this trend seems universal, except for the slight increase...

What to do when you get an error - Hugging Face Course

In this section we'll look at some common errors that can occur when you're trying to generate predictions from your freshly tuned Transformer...

sitemap-questions-62.xml - Stack Overflow

... 2021-06-28 https://stackoverflow.com/questions/7426851/trying-to-use-selenium-2-with-python-bindings-but-im-getting-an-import-error 2021-06-23 ...

AIBench User Manual - BenchCouncil

reconstruction, text summarization, spatial transformer, and learning to rank. We implement sixteen component benchmarks for those AI ...