question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

OOM occurred during the middle of training

See original GitHub issue

I can fine-tune the model at first, even it can train entirely in epoch 1. However, it will become OOM in epoch 2 around 4517/21194. I tried to change scripts like total_num_updates or update_freq several times, but it did’t help. Do you have some idea the OOM problem occurred in the middle part of training and give me some tips? Looking forward for your kindly help. The log shows like below:

2020-11-06 22:55:35 | WARNING | fairseq.trainer | OOM: Ran out of memory with exception: CUDA out of memory. Tried to allocate 28.00 MiB (GPU 1; 10.92 GiB total capacity; 10.13 GiB already allocated; 13.38 MiB free; 10.33 GiB reserved in total by PyTorch)
...
2020-11-06 22:55:35 | WARNING | fairseq.trainer | attempting to recover from OOM in forward/backward pass
Traceback (most recent call last):
  File "/data/rwd/anaconda3/envs/fairseq/bin/fairseq-train", line 33, in <module>
    sys.exit(load_entry_point('fairseq', 'console_scripts', 'fairseq-train')())
  File "/data/rwd/fairseq/fairseq_cli/train.py", line 352, in cli_main
    distributed_utils.call_main(args, main)
  File "/data/rwd/fairseq/fairseq/distributed_utils.py", line 254, in call_main
    nprocs=args.distributed_num_procs,
  File "/data/rwd/anaconda3/envs/fairseq/lib/python3.6/site-packages/torch/multiprocessing/spawn.py", line 171, in spawn
    while not spawn_context.join():
  File "/data/rwd/anaconda3/envs/fairseq/lib/python3.6/site-packages/torch/multiprocessing/spawn.py", line 118, in join
    raise Exception(msg)
Exception: 
-- Process 1 terminated with the following error:
Traceback (most recent call last):
  File "/data/rwd/fairseq/fairseq/distributed_utils.py", line 339, in all_gather_list
    result.append(pickle.loads(bytes(out_buffer[header_size:header_size + enc_size].tolist())))
_pickle.UnpicklingError: unpickling stack underflow
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
  File "/data/rwd/anaconda3/envs/fairseq/lib/python3.6/site-packages/torch/multiprocessing/spawn.py", line 19, in _wrap
    fn(i, *args)
  File "/data/rwd/fairseq/fairseq/distributed_utils.py", line 238, in distributed_main
    main(args, **kwargs)
  File "/data/rwd/fairseq/fairseq_cli/train.py", line 125, in main
    valid_losses, should_stop = train(args, trainer, task, epoch_itr)
  File "/data/rwd/anaconda3/envs/fairseq/lib/python3.6/contextlib.py", line 52, in inner
    return func(*args, **kwds)
  File "/data/rwd/fairseq/fairseq_cli/train.py", line 208, in train
    log_output = trainer.train_step(samples)
  File "/data/rwd/anaconda3/envs/fairseq/lib/python3.6/contextlib.py", line 52, in inner
    return func(*args, **kwds)
  File "/data/rwd/fairseq/fairseq/trainer.py", line 531, in train_step
    logging_outputs, sample_size, ooms, train_time, ignore=is_dummy_batch,
  File "/data/rwd/fairseq/fairseq/trainer.py", line 885, in _aggregate_logging_outputs
    logging_outputs, *extra_stats_to_sum, ignore=ignore
  File "/data/rwd/fairseq/fairseq/trainer.py", line 906, in _all_gather_list_sync
    group=self.data_parallel_process_group,
  File "/data/rwd/fairseq/fairseq/distributed_utils.py", line 343, in all_gather_list
    'Unable to unpickle data from other workers. all_gather_list requires all '
Exception: Unable to unpickle data from other workers. all_gather_list requires all workers to enter the function together, so this error usually indicates that the workers have fallen out of sync somehow. Workers can fall out of sync if one of them runs out of memory, or if there are other conditions in your training script that can cause one worker to finish an epoch while other workers are still iterating over their portions of the data. Try rerunning with --ddp-backend=no_c10d and see if that helps.

Issue Analytics

  • State:closed
  • Created 3 years ago
  • Comments:8 (3 by maintainers)

github_iconTop GitHub Comments

1reaction
monologue1107commented, Nov 12, 2020

the semaphore thing comes from dataloader processes that asynchronously load data. thats the message you get if you kill the process somehow (e.g. your oom crashes for example)

Thx for your answer. It seems being solved when I add --ddp-backend=no_c10d and decrease the update_freq to 1.

0reactions
liaohitcommented, May 12, 2022

I just tried add --ddp-backend=no_c10d and decrease the update_freq to 1, but the warning still exist.

| WARNING | fairseq.trainer | OOM: Ran out of memory with exception: CUDA out of memory. Tried to allocate 28.00 MiB (GPU 1; 10.92 GiB total capacity; 10.13 GiB already allocated; 13.38 MiB free; 10.33 GiB reserved in total by PyTorch) | WARNING | fairseq.trainer | attempting to recover from OOM in forward/backward pass Traceback

Read more comments on GitHub >

github_iconTop Results From Across the Web

OOM occurred during the middle of training · Issue #2867
I can fine-tune the model at first, even it can train entirely in epoch 1. However, it will become OOM in epoch 2...
Read more >
How to make sure the training phase won't be facing an OOM?
The second scenario that I'm facing OOM is when the training process starts, and it goes on for some time. Maybe even a...
Read more >
Solving Out Of Memory (OOM) Errors on Keras and ... - LinkedIn
OOM (Out Of Memory) errors can occur when building and training a neural network model on the GPU. The size of the model...
Read more >
[resolved] Out of memory in the medium of training, always ...
I am wondering if something are saved because of this line. Many thanks in advance. Following is my code: def train(epoch): netD.train() netG....
Read more >
Dumps occur because of OOM condition in SQL Server 2019 ...
Symptoms. When Out-Of-Memory (OOM) condition occurs while running OPENROWSET query that reads a parquet file in SQL Server 2019, dumps may occur.
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found