MB-MelGAN training runs out of memory when starting evaluation
See original GitHub issueI started fine-tuning the multiband_melgan.v1_24k
Universal Vocoder with command:
CUDA_VISIBLE_DEVICES=0 python examples/multiband_melgan/train_multiband_melgan.py \
--train-dir ./dump_LibriTTSFormatted/train/ \
--dev-dir ./dump_LibriTTSFormatted/valid/ \
--outdir ./outdir/MBMELGAN/MBMelgan-Tune-Experiment1 \
--config ./models/multiband_melgan.v1_24k.yaml \
--use-norm 1 \
--pretrained ./models/libritts_24k.h5
After 5000 steps it evaluates, at this point, it runs out of memory and stops.
2020-11-05 18:51:28,699 (base_trainer:138) INFO: (Steps: 4864) Finished 19 epoch training (256 steps per epoch).
[train]: 0%|▏ | 5000/4000000 [12:05<161:16:34, 6.88it/s]2020-11-05 18:51:48,400 (base_trainer:566) INFO: (Step: 5000) train_adversarial_loss = 0.0000.
2020-11-05 18:51:48,401 (base_trainer:566) INFO: (Step: 5000) train_subband_spectral_convergence_loss = 0.8443.
2020-11-05 18:51:48,401 (base_trainer:566) INFO: (Step: 5000) train_subband_log_magnitude_loss = 0.8513.
2020-11-05 18:51:48,401 (base_trainer:566) INFO: (Step: 5000) train_fullband_spectral_convergence_loss = 0.8818.
2020-11-05 18:51:48,402 (base_trainer:566) INFO: (Step: 5000) train_fullband_log_magnitude_loss = 0.9599.
2020-11-05 18:51:48,402 (base_trainer:566) INFO: (Step: 5000) train_gen_loss = 1.7687.
2020-11-05 18:51:48,402 (base_trainer:566) INFO: (Step: 5000) train_real_loss = 0.0000.
2020-11-05 18:51:48,403 (base_trainer:566) INFO: (Step: 5000) train_fake_loss = 0.0000.
2020-11-05 18:51:48,403 (base_trainer:566) INFO: (Step: 5000) train_dis_loss = 0.0000.
2020-11-05 18:51:48,411 (base_trainer:418) INFO: (Steps: 5000) Start evaluation.
2020-11-05 18:51:52.343952: W tensorflow/core/common_runtime/bfc_allocator.cc:246] Allocator (GPU_0_bfc) ran out of memory trying to allocate 449.88MiB with freed_by_count=0. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available.
2020-11-05 18:51:52.381361: E tensorflow/stream_executor/cuda/cuda_fft.cc:249] failed to allocate work area.
2020-11-05 18:51:52.381371: E tensorflow/stream_executor/cuda/cuda_fft.cc:426] Initialize Params: rank: 1 elem_count: 683 input_embed: 683 input_stride: 1 input_distance: 683 output_embed: 342 output_stride: 1 output_distance: 342 batch_count: 86272
2020-11-05 18:51:52.381377: F tensorflow/stream_executor/cuda/cuda_fft.cc:435] failed to initialize batched cufft plan with customized allocator:
Any ideas?
Issue Analytics
- State:
- Created 3 years ago
- Comments:16 (12 by maintainers)
Top Results From Across the Web
Solving Out Of Memory (OOM) Errors on Keras and ... - LinkedIn
OOM (Out Of Memory) errors can occur when building and training a neural network model on the GPU. The size of the model...
Read more >Out of memory error during evaluation but training works fine
I think it fails during validation because the volatile flag is now deprecated and has no effect. Starting from 0.4.0, to avoid the...
Read more >PPO trainer eating up memory - RLlib - Ray
Hi there, I'm trying to train a PPO agent via self play in my multi-agent env. At the moment it can manage about...
Read more >Memory Hygiene With TensorFlow During Model Training and ...
The above video clearly shows the out of memory error. TensorFlow aggressively occupies the full GPU memory even though it actually doesn't need...
Read more >RuntimeError: CUDA out of memory + gpu ... - PyTorch Forums
After the first epoch of training colab throws error related to RuntimeError: CUDA out of memory for batch_size:[8,16,32,64], just before running evaluation ......
Read more >
Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free
Top Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
I tested this and training now passes the evaluation stage without running out of memory, so the fix worked 😃 Thanks
@peter05010402 Thanks for checking this
I will create a new issue and try and do some debugging 😃
I would have tried debugging with breakpoints already, but I’ve never done this before on a headless SSH session, and I can’t reproduce the same problem on my own Windows machine. But I’ll investigate how to do this soon.