OOM occurred during the middle of training
See original GitHub issueI can fine-tune the model at first, even it can train entirely in epoch 1. However, it will become OOM in epoch 2 around 4517/21194. I tried to change scripts like total_num_updates or update_freq several times, but it did’t help. Do you have some idea the OOM problem occurred in the middle part of training and give me some tips? Looking forward for your kindly help. The log shows like below:
2020-11-06 22:55:35 | WARNING | fairseq.trainer | OOM: Ran out of memory with exception: CUDA out of memory. Tried to allocate 28.00 MiB (GPU 1; 10.92 GiB total capacity; 10.13 GiB already allocated; 13.38 MiB free; 10.33 GiB reserved in total by PyTorch)
...
2020-11-06 22:55:35 | WARNING | fairseq.trainer | attempting to recover from OOM in forward/backward pass
Traceback (most recent call last):
File "/data/rwd/anaconda3/envs/fairseq/bin/fairseq-train", line 33, in <module>
sys.exit(load_entry_point('fairseq', 'console_scripts', 'fairseq-train')())
File "/data/rwd/fairseq/fairseq_cli/train.py", line 352, in cli_main
distributed_utils.call_main(args, main)
File "/data/rwd/fairseq/fairseq/distributed_utils.py", line 254, in call_main
nprocs=args.distributed_num_procs,
File "/data/rwd/anaconda3/envs/fairseq/lib/python3.6/site-packages/torch/multiprocessing/spawn.py", line 171, in spawn
while not spawn_context.join():
File "/data/rwd/anaconda3/envs/fairseq/lib/python3.6/site-packages/torch/multiprocessing/spawn.py", line 118, in join
raise Exception(msg)
Exception:
-- Process 1 terminated with the following error:
Traceback (most recent call last):
File "/data/rwd/fairseq/fairseq/distributed_utils.py", line 339, in all_gather_list
result.append(pickle.loads(bytes(out_buffer[header_size:header_size + enc_size].tolist())))
_pickle.UnpicklingError: unpickling stack underflow
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/data/rwd/anaconda3/envs/fairseq/lib/python3.6/site-packages/torch/multiprocessing/spawn.py", line 19, in _wrap
fn(i, *args)
File "/data/rwd/fairseq/fairseq/distributed_utils.py", line 238, in distributed_main
main(args, **kwargs)
File "/data/rwd/fairseq/fairseq_cli/train.py", line 125, in main
valid_losses, should_stop = train(args, trainer, task, epoch_itr)
File "/data/rwd/anaconda3/envs/fairseq/lib/python3.6/contextlib.py", line 52, in inner
return func(*args, **kwds)
File "/data/rwd/fairseq/fairseq_cli/train.py", line 208, in train
log_output = trainer.train_step(samples)
File "/data/rwd/anaconda3/envs/fairseq/lib/python3.6/contextlib.py", line 52, in inner
return func(*args, **kwds)
File "/data/rwd/fairseq/fairseq/trainer.py", line 531, in train_step
logging_outputs, sample_size, ooms, train_time, ignore=is_dummy_batch,
File "/data/rwd/fairseq/fairseq/trainer.py", line 885, in _aggregate_logging_outputs
logging_outputs, *extra_stats_to_sum, ignore=ignore
File "/data/rwd/fairseq/fairseq/trainer.py", line 906, in _all_gather_list_sync
group=self.data_parallel_process_group,
File "/data/rwd/fairseq/fairseq/distributed_utils.py", line 343, in all_gather_list
'Unable to unpickle data from other workers. all_gather_list requires all '
Exception: Unable to unpickle data from other workers. all_gather_list requires all workers to enter the function together, so this error usually indicates that the workers have fallen out of sync somehow. Workers can fall out of sync if one of them runs out of memory, or if there are other conditions in your training script that can cause one worker to finish an epoch while other workers are still iterating over their portions of the data. Try rerunning with --ddp-backend=no_c10d and see if that helps.
Issue Analytics
- State:
- Created 3 years ago
- Comments:8 (3 by maintainers)
Top Results From Across the Web
OOM occurred during the middle of training · Issue #2867
I can fine-tune the model at first, even it can train entirely in epoch 1. However, it will become OOM in epoch 2...
Read more >How to make sure the training phase won't be facing an OOM?
The second scenario that I'm facing OOM is when the training process starts, and it goes on for some time. Maybe even a...
Read more >Solving Out Of Memory (OOM) Errors on Keras and ... - LinkedIn
OOM (Out Of Memory) errors can occur when building and training a neural network model on the GPU. The size of the model...
Read more >[resolved] Out of memory in the medium of training, always ...
I am wondering if something are saved because of this line. Many thanks in advance. Following is my code: def train(epoch): netD.train() netG....
Read more >Dumps occur because of OOM condition in SQL Server 2019 ...
Symptoms. When Out-Of-Memory (OOM) condition occurs while running OPENROWSET query that reads a parquet file in SQL Server 2019, dumps may occur.
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
Thx for your answer. It seems being solved when I add --ddp-backend=no_c10d and decrease the update_freq to 1.
I just tried add --ddp-backend=no_c10d and decrease the update_freq to 1, but the warning still exist.
| WARNING | fairseq.trainer | OOM: Ran out of memory with exception: CUDA out of memory. Tried to allocate 28.00 MiB (GPU 1; 10.92 GiB total capacity; 10.13 GiB already allocated; 13.38 MiB free; 10.33 GiB reserved in total by PyTorch) | WARNING | fairseq.trainer | attempting to recover from OOM in forward/backward pass Traceback