fairseq stuck during training
See original GitHub issueSince last fairseq versions, during the training of a transformer_vaswani_wmt_en_de_big
the process gets stuck, normally after an OOM batch but not necessarily.
It is reproduceable with pytorch 1.0.1, 1.1.0 and nightly as of today, all with either CUDA 9 or CUDA 10, and the latest master of fairseq (39cd4ce2723d550dd54f6b14b0ed2878e10427f8).
This is the command Iine invocation I’m using:
fairseq-train $DATA_DIR \
--tensorboard-logdir $CHECKPOINTS_DIR/tb \
-s en -t de \
--arch transformer_vaswani_wmt_en_de_big \
--share-all-embeddings \
--optimizer adam --adam-betas '(0.9, 0.98)' \
--lr-scheduler inverse_sqrt --warmup-init-lr 1e-07 --warmup-updates 4000 \
--lr 0.0007 --min-lr 1e-09 \
--clip-norm 0.0 \
--update-freq 8 \
--dropout 0.3 --weight-decay 0.0 \
--criterion label_smoothed_cross_entropy --label-smoothing 0.1 \
--max-tokens 3000 \
--save-dir $CHECKPOINTS_DIR
The problem happens with multiple GPUs (I reproduced it with 4 GPUs and with 2 GPUs).
Python version is 3.6. OS is Ubuntu 16.04.2 on one machine and 18.04 in the other one. GPUs are 1080Ti’s.
After getting stuck for an while with no new log lines, I CTRL+C it, getting this stack trace:
WARNING: ran out of memory with exception: CUDA out o
f memory. Tried to allocate 354.00 MiB (GPU 0; 10.91 GiB total capacity; 9.27 GiB already allocated; 207.38 MiB free; 913.54 MiB cached);
Skipping batch
^CTraceback (most recent call last):
File "/mnt/md0/home/noe/miniconda3/envs/nlp_pytorch/bin/fairseq-train", line 11, in <module>
load_entry_point('fairseq', 'console_scripts', 'fairseq-train')()
File "/mnt/md0/home/noe/devel/nmt-word-subword/3party/fairseq/fairseq_cli/train.py", line 419, in cli_main
nprocs=args.distributed_world_size,
File "/mnt/md0/home/noe/miniconda3/envs/nlp_pytorch/lib/python3.6/site-packages/torch/multiprocessing/spawn.py", line 167, in spawn
while not spawn_context.join():
File "/mnt/md0/home/noe/miniconda3/envs/nlp_pytorch/lib/python3.6/site-packages/torch/multiprocessing/spawn.py", line 73, in join
timeout=timeout,
File "/mnt/md0/home/noe/miniconda3/envs/nlp_pytorch/lib/python3.6/multiprocessing/connection.py", line 911, in wait
ready = selector.select(timeout)
File "/mnt/md0/home/noe/miniconda3/envs/nlp_pytorch/lib/python3.6/selectors.py", line 376, in select
fd_event_list = self._poll.poll(timeout)
KeyboardInterrupt
^CError in atexit._run_exitfuncs:
Traceback (most recent call last):
File "/mnt/md0/home/noe/miniconda3/envs/nlp_pytorch/lib/python3.6/multiprocessing/popen_fork.py", line 28, in poll
pid, sts = os.waitpid(self.pid, flag)
KeyboardInterrupt
After CTRL+C, I systematically need to manually kill the children processes, which are still occupying GPU memory.
When I run with --ddp-backend no_c10d
, the process does not get stuck but crashes with the following stack trace:
WARNING: ran out of memory with exception: CUDA out of memory. Tried to allocate 332.00 MiB (GPU 0; 10.91 GiB total capacity; 9.33 GiB already allocated; 299.38 MiB free; 756.70 MiB cached);
Skipping batch
/mnt/md0/home/noe/miniconda3/envs/nlp_pytorch/lib/python3.6/multiprocessing/semaphore_tracker.py:143: UserWarning: semaphore_tracker: There appear to be 1 leaked semaphores to clean up at shutdown
len(cache))
/mnt/md0/home/noe/miniconda3/envs/nlp_pytorch/lib/python3.6/multiprocessing/semaphore_tracker.py:143: UserWarning: semaphore_tracker: There appear to be 1 leaked semaphores to clean up at shutdown
len(cache))
/mnt/md0/home/noe/miniconda3/envs/nlp_pytorch/lib/python3.6/multiprocessing/semaphore_tracker.py:143: UserWarning: semaphore_tracker: There appear to be 1 leaked semaphores to clean up at shutdown
len(cache))
Traceback (most recent call last):
File "/mnt/md0/home/noe/miniconda3/envs/nlp_pytorch/bin/fairseq-train", line 11, in <module>
load_entry_point('fairseq', 'console_scripts', 'fairseq-train')()
File "/mnt/md0/home/noe/devel/nmt-word-subword/3party/fairseq/fairseq_cli/train.py", line 439, in cli_main
nprocs=args.distributed_world_size,
File "/mnt/md0/home/noe/miniconda3/envs/nlp_pytorch/lib/python3.6/site-packages/torch/multiprocessing/spawn.py", line 167, in spawn
while not spawn_context.join():
File "/mnt/md0/home/noe/miniconda3/envs/nlp_pytorch/lib/python3.6/site-packages/torch/multiprocessing/spawn.py", line 114, in join
raise Exception(msg)
Exception:
-- Process 0 terminated with the following error:
Traceback (most recent call last):
File "/mnt/md0/home/noe/devel/nmt-word-subword/3party/fairseq/fairseq/distributed_utils.py", line 169, in all_gather_list
result.append(pickle.loads(bytes(out_buffer[2 : size + 2].tolist())))
_pickle.UnpicklingError: pickle data was truncated
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/mnt/md0/home/noe/miniconda3/envs/nlp_pytorch/lib/python3.6/site-packages/torch/multiprocessing/spawn.py", line 19, in _wrap
fn(i, *args)
File "/mnt/md0/home/noe/devel/nmt-word-subword/3party/fairseq/fairseq_cli/train.py", line 406, in distributed_main
main(args, init_distributed=True)
File "/mnt/md0/home/noe/devel/nmt-word-subword/3party/fairseq/fairseq_cli/train.py", line 100, in main
train(args, trainer, task, epoch_itr)
File "/mnt/md0/home/noe/devel/nmt-word-subword/3party/fairseq/fairseq_cli/train.py", line 159, in train
log_output = trainer.train_step(samples)
File "/mnt/md0/home/noe/devel/nmt-word-subword/3party/fairseq/fairseq/trainer.py", line 253, in train_step
[logging_outputs, sample_sizes, ooms, self._prev_grad_norm],
File "/mnt/md0/home/noe/devel/nmt-word-subword/3party/fairseq/fairseq/distributed_utils.py", line 173, in all_gather_list
'Unable to unpickle data from other workers. all_gather_list requires all '
Exception: Unable to unpickle data from other workers. all_gather_list requires all workers to enter the function together, so this error usually indicates that the workers have fallen out of sync somehow. Workers can fall out of sync if one of them runs out of memory, or if there are other conditions in your training script that can cause one worker to finish an epoch while other workers are still iterating over their portions of the data.
The last message is clear:
'Unable to unpickle data from other workers. all_gather_list requires all '
'workers to enter the function together, so this error usually indicates '
'that the workers have fallen out of sync somehow. Workers can fall out of '
'sync if one of them runs out of memory, or if there are other conditions '
'in your training script that can cause one worker to finish an epoch '
'while other workers are still iterating over their portions of the data.'
So, if a batch causes OOM then the distributed training is doomed? This wasn’t happening a few weeks ago.
Issue Analytics
- State:
- Created 4 years ago
- Reactions:4
- Comments:12 (7 by maintainers)
Top GitHub Comments
We try to catch OOM by skipping the batch, but sometimes it doesn’t work (often in the multi GPU case). Usually this causes it to become stuck when the workers are not in sync.
If you’re using
--ddp-backend=c10d
then troublesome OOMs can cause hangs. This is because the c10d DistributedDataParallel module communicates gradients during the backward pass, so we can’t really recover from an OOM during the backward pass. The no_c10d backend is more robust since it only communicates at the end of the backward pass, but there are still limits to this kind of recovery.The solution is usually to reduce batch size (and possibly compensate for this with
--update-freq
).