Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

OOM occurred during the middle of training

See original GitHub issue

I can fine-tune the model at first, even it can train entirely in epoch 1. However, it will become OOM in epoch 2 around 4517/21194. I tried to change scripts like total_num_updates or update_freq several times, but it did’t help. Do you have some idea the OOM problem occurred in the middle part of training and give me some tips? Looking forward for your kindly help. The log shows like below:

2020-11-06 22:55:35 | WARNING | fairseq.trainer | OOM: Ran out of memory with exception: CUDA out of memory. Tried to allocate 28.00 MiB (GPU 1; 10.92 GiB total capacity; 10.13 GiB already allocated; 13.38 MiB free; 10.33 GiB reserved in total by PyTorch)
...
2020-11-06 22:55:35 | WARNING | fairseq.trainer | attempting to recover from OOM in forward/backward pass
Traceback (most recent call last):
  File "/data/rwd/anaconda3/envs/fairseq/bin/fairseq-train", line 33, in <module>
    sys.exit(load_entry_point('fairseq', 'console_scripts', 'fairseq-train')())
  File "/data/rwd/fairseq/fairseq_cli/train.py", line 352, in cli_main
    distributed_utils.call_main(args, main)
  File "/data/rwd/fairseq/fairseq/distributed_utils.py", line 254, in call_main
    nprocs=args.distributed_num_procs,
  File "/data/rwd/anaconda3/envs/fairseq/lib/python3.6/site-packages/torch/multiprocessing/spawn.py", line 171, in spawn
    while not spawn_context.join():
  File "/data/rwd/anaconda3/envs/fairseq/lib/python3.6/site-packages/torch/multiprocessing/spawn.py", line 118, in join
    raise Exception(msg)
Exception: 
-- Process 1 terminated with the following error:
Traceback (most recent call last):
  File "/data/rwd/fairseq/fairseq/distributed_utils.py", line 339, in all_gather_list
    result.append(pickle.loads(bytes(out_buffer[header_size:header_size + enc_size].tolist())))
_pickle.UnpicklingError: unpickling stack underflow
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
  File "/data/rwd/anaconda3/envs/fairseq/lib/python3.6/site-packages/torch/multiprocessing/spawn.py", line 19, in _wrap
    fn(i, *args)
  File "/data/rwd/fairseq/fairseq/distributed_utils.py", line 238, in distributed_main
    main(args, **kwargs)
  File "/data/rwd/fairseq/fairseq_cli/train.py", line 125, in main
    valid_losses, should_stop = train(args, trainer, task, epoch_itr)
  File "/data/rwd/anaconda3/envs/fairseq/lib/python3.6/contextlib.py", line 52, in inner
    return func(*args, **kwds)
  File "/data/rwd/fairseq/fairseq_cli/train.py", line 208, in train
    log_output = trainer.train_step(samples)
  File "/data/rwd/anaconda3/envs/fairseq/lib/python3.6/contextlib.py", line 52, in inner
    return func(*args, **kwds)
  File "/data/rwd/fairseq/fairseq/trainer.py", line 531, in train_step
    logging_outputs, sample_size, ooms, train_time, ignore=is_dummy_batch,
  File "/data/rwd/fairseq/fairseq/trainer.py", line 885, in _aggregate_logging_outputs
    logging_outputs, *extra_stats_to_sum, ignore=ignore
  File "/data/rwd/fairseq/fairseq/trainer.py", line 906, in _all_gather_list_sync
    group=self.data_parallel_process_group,
  File "/data/rwd/fairseq/fairseq/distributed_utils.py", line 343, in all_gather_list
    'Unable to unpickle data from other workers. all_gather_list requires all '
Exception: Unable to unpickle data from other workers. all_gather_list requires all workers to enter the function together, so this error usually indicates that the workers have fallen out of sync somehow. Workers can fall out of sync if one of them runs out of memory, or if there are other conditions in your training script that can cause one worker to finish an epoch while other workers are still iterating over their portions of the data. Try rerunning with --ddp-backend=no_c10d and see if that helps.

Issue Analytics

State:
Created 3 years ago
Comments:8 (3 by maintainers)

Top GitHub Comments

1reaction

monologue1107commented, Nov 12, 2020

the semaphore thing comes from dataloader processes that asynchronously load data. thats the message you get if you kill the process somehow (e.g. your oom crashes for example)

Thx for your answer. It seems being solved when I add --ddp-backend=no_c10d and decrease the update_freq to 1.

0reactions

liaohitcommented, May 12, 2022

I just tried add --ddp-backend=no_c10d and decrease the update_freq to 1, but the warning still exist.

| WARNING | fairseq.trainer | OOM: Ran out of memory with exception: CUDA out of memory. Tried to allocate 28.00 MiB (GPU 1; 10.92 GiB total capacity; 10.13 GiB already allocated; 13.38 MiB free; 10.33 GiB reserved in total by PyTorch) | WARNING | fairseq.trainer | attempting to recover from OOM in forward/backward pass Traceback