question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

fairseq stuck during training

See original GitHub issue

Since last fairseq versions, during the training of a transformer_vaswani_wmt_en_de_big the process gets stuck, normally after an OOM batch but not necessarily.

It is reproduceable with pytorch 1.0.1, 1.1.0 and nightly as of today, all with either CUDA 9 or CUDA 10, and the latest master of fairseq (39cd4ce2723d550dd54f6b14b0ed2878e10427f8).

This is the command Iine invocation I’m using:

fairseq-train $DATA_DIR \
  --tensorboard-logdir $CHECKPOINTS_DIR/tb \
  -s en -t de \
  --arch transformer_vaswani_wmt_en_de_big \
  --share-all-embeddings \
  --optimizer adam --adam-betas '(0.9, 0.98)' \
  --lr-scheduler inverse_sqrt --warmup-init-lr 1e-07 --warmup-updates 4000 \
  --lr 0.0007 --min-lr 1e-09 \
  --clip-norm 0.0 \
  --update-freq 8 \
  --dropout 0.3 --weight-decay 0.0 \
  --criterion label_smoothed_cross_entropy --label-smoothing 0.1 \
  --max-tokens 3000 \
  --save-dir $CHECKPOINTS_DIR

The problem happens with multiple GPUs (I reproduced it with 4 GPUs and with 2 GPUs).

Python version is 3.6. OS is Ubuntu 16.04.2 on one machine and 18.04 in the other one. GPUs are 1080Ti’s.

After getting stuck for an while with no new log lines, I CTRL+C it, getting this stack trace:

WARNING: ran out of memory with exception: CUDA out o
f memory. Tried to allocate 354.00 MiB (GPU 0; 10.91 GiB total capacity; 9.27 GiB already allocated; 207.38 MiB free; 913.54 MiB cached);
 Skipping batch                                                                                                                          
                                                                                                                                         
^CTraceback (most recent call last):                                                                                                     
  File "/mnt/md0/home/noe/miniconda3/envs/nlp_pytorch/bin/fairseq-train", line 11, in <module>                                           
    load_entry_point('fairseq', 'console_scripts', 'fairseq-train')()                                                                    
  File "/mnt/md0/home/noe/devel/nmt-word-subword/3party/fairseq/fairseq_cli/train.py", line 419, in cli_main                             
    nprocs=args.distributed_world_size,                                                                                                  
  File "/mnt/md0/home/noe/miniconda3/envs/nlp_pytorch/lib/python3.6/site-packages/torch/multiprocessing/spawn.py", line 167, in spawn    
    while not spawn_context.join():                                                                                                      
  File "/mnt/md0/home/noe/miniconda3/envs/nlp_pytorch/lib/python3.6/site-packages/torch/multiprocessing/spawn.py", line 73, in join      
    timeout=timeout,                                                                                                                     
  File "/mnt/md0/home/noe/miniconda3/envs/nlp_pytorch/lib/python3.6/multiprocessing/connection.py", line 911, in wait                    
    ready = selector.select(timeout)                                                          
  File "/mnt/md0/home/noe/miniconda3/envs/nlp_pytorch/lib/python3.6/selectors.py", line 376, in select                                   
    fd_event_list = self._poll.poll(timeout)                                                                                             
KeyboardInterrupt
^CError in atexit._run_exitfuncs:
Traceback (most recent call last):
  File "/mnt/md0/home/noe/miniconda3/envs/nlp_pytorch/lib/python3.6/multiprocessing/popen_fork.py", line 28, in poll
    pid, sts = os.waitpid(self.pid, flag)
KeyboardInterrupt

After CTRL+C, I systematically need to manually kill the children processes, which are still occupying GPU memory.

When I run with --ddp-backend no_c10d, the process does not get stuck but crashes with the following stack trace:

WARNING: ran out of memory with exception: CUDA out of memory. Tried to allocate 332.00 MiB (GPU 0; 10.91 GiB total capacity; 9.33 GiB already allocated; 299.38 MiB free; 756.70 MiB cached);
 Skipping batch                           
                                                                                                                                                                           /mnt/md0/home/noe/miniconda3/envs/nlp_pytorch/lib/python3.6/multiprocessing/semaphore_tracker.py:143: UserWarning: semaphore_tracker: There appear to be 1 leaked semaphores to clean up at shutdown
  len(cache))                             
/mnt/md0/home/noe/miniconda3/envs/nlp_pytorch/lib/python3.6/multiprocessing/semaphore_tracker.py:143: UserWarning: semaphore_tracker: There appear to be 1 leaked semaphores to clean up at shutdown
  len(cache))                             
/mnt/md0/home/noe/miniconda3/envs/nlp_pytorch/lib/python3.6/multiprocessing/semaphore_tracker.py:143: UserWarning: semaphore_tracker: There appear to be 1 leaked semaphores to clean up at shutdown
  len(cache))                             
Traceback (most recent call last):        
  File "/mnt/md0/home/noe/miniconda3/envs/nlp_pytorch/bin/fairseq-train", line 11, in <module>                                                                             
    load_entry_point('fairseq', 'console_scripts', 'fairseq-train')()                
  File "/mnt/md0/home/noe/devel/nmt-word-subword/3party/fairseq/fairseq_cli/train.py", line 439, in cli_main                                                               
    nprocs=args.distributed_world_size,   
  File "/mnt/md0/home/noe/miniconda3/envs/nlp_pytorch/lib/python3.6/site-packages/torch/multiprocessing/spawn.py", line 167, in spawn                                      
    while not spawn_context.join():       
  File "/mnt/md0/home/noe/miniconda3/envs/nlp_pytorch/lib/python3.6/site-packages/torch/multiprocessing/spawn.py", line 114, in join                                       
    raise Exception(msg)                  
Exception:                                

-- Process 0 terminated with the following error:                                    
Traceback (most recent call last):        
  File "/mnt/md0/home/noe/devel/nmt-word-subword/3party/fairseq/fairseq/distributed_utils.py", line 169, in all_gather_list                                                
    result.append(pickle.loads(bytes(out_buffer[2 : size + 2].tolist())))            
_pickle.UnpicklingError: pickle data was truncated                                   

During handling of the above exception, another exception occurred:                  

Traceback (most recent call last):        
  File "/mnt/md0/home/noe/miniconda3/envs/nlp_pytorch/lib/python3.6/site-packages/torch/multiprocessing/spawn.py", line 19, in _wrap                                       
    fn(i, *args)                          
  File "/mnt/md0/home/noe/devel/nmt-word-subword/3party/fairseq/fairseq_cli/train.py", line 406, in distributed_main                                                       
    main(args, init_distributed=True)     
  File "/mnt/md0/home/noe/devel/nmt-word-subword/3party/fairseq/fairseq_cli/train.py", line 100, in main                                                                   
    train(args, trainer, task, epoch_itr) 
  File "/mnt/md0/home/noe/devel/nmt-word-subword/3party/fairseq/fairseq_cli/train.py", line 159, in train                                                                  
    log_output = trainer.train_step(samples)                                         
  File "/mnt/md0/home/noe/devel/nmt-word-subword/3party/fairseq/fairseq/trainer.py", line 253, in train_step                                                               
    [logging_outputs, sample_sizes, ooms, self._prev_grad_norm],                     
  File "/mnt/md0/home/noe/devel/nmt-word-subword/3party/fairseq/fairseq/distributed_utils.py", line 173, in all_gather_list                                                
    'Unable to unpickle data from other workers. all_gather_list requires all '      
Exception: Unable to unpickle data from other workers. all_gather_list requires all workers to enter the function together, so this error usually indicates that the workers have fallen out of sync somehow. Workers can fall out of sync if one of them runs out of memory, or if there are other conditions in your training script that can cause one worker to finish an epoch while other workers are still iterating over their portions of the data.

The last message is clear:

            'Unable to unpickle data from other workers. all_gather_list requires all '
            'workers to enter the function together, so this error usually indicates '
            'that the workers have fallen out of sync somehow. Workers can fall out of '
            'sync if one of them runs out of memory, or if there are other conditions '
            'in your training script that can cause one worker to finish an epoch '
            'while other workers are still iterating over their portions of the data.'

So, if a batch causes OOM then the distributed training is doomed? This wasn’t happening a few weeks ago.

Issue Analytics

  • State:closed
  • Created 4 years ago
  • Reactions:4
  • Comments:12 (7 by maintainers)

github_iconTop GitHub Comments

4reactions
huihuifancommented, May 13, 2019

We try to catch OOM by skipping the batch, but sometimes it doesn’t work (often in the multi GPU case). Usually this causes it to become stuck when the workers are not in sync.

2reactions
myleottcommented, May 15, 2019

what happens to the “troublesome OOMs” in that catch block?

If you’re using --ddp-backend=c10d then troublesome OOMs can cause hangs. This is because the c10d DistributedDataParallel module communicates gradients during the backward pass, so we can’t really recover from an OOM during the backward pass. The no_c10d backend is more robust since it only communicates at the end of the backward pass, but there are still limits to this kind of recovery.

The solution is usually to reduce batch size (and possibly compensate for this with --update-freq).

Read more comments on GitHub >

github_iconTop Results From Across the Web

fairseq stuck during training · Issue #708 - GitHub
GPUs are 1080Ti's. After getting stuck for an while with no new log lines, I CTRL+C it, getting this stack trace: WARNING: ran ......
Read more >
fairseq Users | Hi, | Facebook
Hi,. I'm using a big model from fairseq. And I'm experimenting using pre-trained embedding with fairseq. But when i start training with embeddings...
Read more >
fairseq Users - Google Groups
Creating new "prepare.sh" file for training a new model. Hello, I am new to the NLP domain and also a new member of...
Read more >
Training Tips for the Transformer Model
This article describes our experiments in neural machine translation using the recent Ten-. sor2Tensor framework and the Transformer sequence-to-sequence model ...
Read more >
Improving PyTorch inference performance on GPUs with a few ...
That is, less "stuck-in-a-rut" thinking and more ... existed trained model in a framework like fairseq , HuggingFace transformers, or a nn.
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found