question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

SIGSEGV error while trying to train the Levenshtein transformer

See original GitHub issue

I’m trying to train the Levenstein transformer with the suggested dataset and settings (but max-tokens set to 4000) on 1 machine with 4 V100 32GB GPUs. I’m using pytorch 1.2 and python 3.6, on a Scientific Linux 7.6 distribution.

The process always fails with:

| model levenshtein_transformer, criterion LabelSmoothedDualImitationCriterion
| num. model params: 66251776 (num. trained: 66251776)
| training on 4 GPUs
| max tokens per GPU = 4000 and max sentences per GPU = None
| no existing checkpoint found checkpoints/levt/checkpoint_last.pt
| loading train data for epoch 0
| loaded 3961179 examples from: data-bin/joint-bpe-37k/train.en-de.en
| loaded 3961179 examples from: data-bin/joint-bpe-37k/train.en-de.de
| data-bin/joint-bpe-37k train en-de 3961179 examples
Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "/appl/python/3.6.2/lib/python3.6/multiprocessing/spawn.py", line 105, in spawn_main
    exitcode = _main(fd)
  File "/appl/python/3.6.2/lib/python3.6/multiprocessing/spawn.py", line 115, in _main
    self = reduction.pickle.load(from_parent)
_pickle.UnpicklingError: pickle data was truncated
Traceback (most recent call last):
  File "fairseq_cli/train.py", line 342, in <module>
    cli_main()
  File "fairseq_cli/train.py", line 334, in cli_main
    nprocs=args.distributed_world_size,
  File "/zhome/60/6/124738/.local/lib/python3.6/site-packages/torch/multiprocessing/spawn.py", line 171, in spawn
    while not spawn_context.join():
  File "/zhome/60/6/124738/.local/lib/python3.6/site-packages/torch/multiprocessing/spawn.py", line 107, in join
    (error_index, name)
Exception: process 1 terminated with signal SIGSEGV
~/repos/fairseq
n-62-20-9(s172185) $ 
~/repos/fairseq
n-62-20-9(s172185) $ Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "/appl/python/3.6.2/lib/python3.6/multiprocessing/spawn.py", line 105, in spawn_main
    exitcode = _main(fd)
  File "/appl/python/3.6.2/lib/python3.6/multiprocessing/spawn.py", line 115, in _main
    self = reduction.pickle.load(from_parent)
_pickle.UnpicklingError: pickle data was truncated
/appl/python/3.6.2/lib/python3.6/multiprocessing/semaphore_tracker.py:129: UserWarning: semaphore_tracker: There appear to be 34 leaked semaphores to clean up at shutdown
  len(cache))

When I try to run it on 1 GPU with the --distributed-world-size 1 setting then I simply get a Segmentation fault without any traceback.

What could cause this issue?

Thank you!

Issue Analytics

  • State:closed
  • Created 4 years ago
  • Comments:8 (2 by maintainers)

github_iconTop GitHub Comments

2reactions
danhorvathcommented, Nov 1, 2019

Thank you for the suggestions @gvskalyan. Unfortunately the same error occurs with pytorch 1.1.0 as well.

Read more comments on GitHub >

github_iconTop Results From Across the Web

SIGSEGV error while trying to train the Levenshtein transformer
I'm trying to train the Levenstein transformer with the suggested dataset and settings (but max-tokens set to 4000) on 1 machine with 4...
Read more >
trax-ml/community - Gitter
Moving average training log-ppl was used, so the curve looks a bit weird, but this trend seems universal, except for the slight increase...
Read more >
What to do when you get an error - Hugging Face Course
In this section we'll look at some common errors that can occur when you're trying to generate predictions from your freshly tuned Transformer...
Read more >
sitemap-questions-62.xml - Stack Overflow
... 2021-06-28 https://stackoverflow.com/questions/7426851/trying-to-use-selenium-2-with-python-bindings-but-im-getting-an-import-error 2021-06-23 ...
Read more >
AIBench User Manual - BenchCouncil
reconstruction, text summarization, spatial transformer, and learning to rank. We implement sixteen component benchmarks for those AI ...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found