question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

MB-MelGAN training runs out of memory when starting evaluation

See original GitHub issue

I started fine-tuning the multiband_melgan.v1_24k Universal Vocoder with command:

CUDA_VISIBLE_DEVICES=0 python examples/multiband_melgan/train_multiband_melgan.py \
  --train-dir ./dump_LibriTTSFormatted/train/ \
  --dev-dir ./dump_LibriTTSFormatted/valid/ \
  --outdir ./outdir/MBMELGAN/MBMelgan-Tune-Experiment1 \
  --config ./models/multiband_melgan.v1_24k.yaml \
  --use-norm 1 \
  --pretrained ./models/libritts_24k.h5

After 5000 steps it evaluates, at this point, it runs out of memory and stops.

2020-11-05 18:51:28,699 (base_trainer:138) INFO: (Steps: 4864) Finished 19 epoch training (256 steps per epoch).
[train]:   0%|▏                                                                                                                                                                                   | 5000/4000000 [12:05<161:16:34,  6.88it/s]2020-11-05 18:51:48,400 (base_trainer:566) INFO: (Step: 5000) train_adversarial_loss = 0.0000.
2020-11-05 18:51:48,401 (base_trainer:566) INFO: (Step: 5000) train_subband_spectral_convergence_loss = 0.8443.
2020-11-05 18:51:48,401 (base_trainer:566) INFO: (Step: 5000) train_subband_log_magnitude_loss = 0.8513.
2020-11-05 18:51:48,401 (base_trainer:566) INFO: (Step: 5000) train_fullband_spectral_convergence_loss = 0.8818.
2020-11-05 18:51:48,402 (base_trainer:566) INFO: (Step: 5000) train_fullband_log_magnitude_loss = 0.9599.
2020-11-05 18:51:48,402 (base_trainer:566) INFO: (Step: 5000) train_gen_loss = 1.7687.
2020-11-05 18:51:48,402 (base_trainer:566) INFO: (Step: 5000) train_real_loss = 0.0000.
2020-11-05 18:51:48,403 (base_trainer:566) INFO: (Step: 5000) train_fake_loss = 0.0000.
2020-11-05 18:51:48,403 (base_trainer:566) INFO: (Step: 5000) train_dis_loss = 0.0000.
2020-11-05 18:51:48,411 (base_trainer:418) INFO: (Steps: 5000) Start evaluation.
                          2020-11-05 18:51:52.343952: W tensorflow/core/common_runtime/bfc_allocator.cc:246] Allocator (GPU_0_bfc) ran out of memory trying to allocate 449.88MiB with freed_by_count=0. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available.
2020-11-05 18:51:52.381361: E tensorflow/stream_executor/cuda/cuda_fft.cc:249] failed to allocate work area.
2020-11-05 18:51:52.381371: E tensorflow/stream_executor/cuda/cuda_fft.cc:426] Initialize Params: rank: 1 elem_count: 683 input_embed: 683 input_stride: 1 input_distance: 683 output_embed: 342 output_stride: 1 output_distance: 342 batch_count: 86272
2020-11-05 18:51:52.381377: F tensorflow/stream_executor/cuda/cuda_fft.cc:435] failed to initialize batched cufft plan with customized allocator:

Any ideas?

Issue Analytics

  • State:closed
  • Created 3 years ago
  • Comments:16 (12 by maintainers)

github_iconTop GitHub Comments

1reaction
OscarVanLcommented, Nov 6, 2020

I tested this and training now passes the evaluation stage without running out of memory, so the fix worked 😃 Thanks

0reactions
OscarVanLcommented, Nov 7, 2020

@peter05010402 Thanks for checking this

I will create a new issue and try and do some debugging 😃

I would have tried debugging with breakpoints already, but I’ve never done this before on a headless SSH session, and I can’t reproduce the same problem on my own Windows machine. But I’ll investigate how to do this soon.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Solving Out Of Memory (OOM) Errors on Keras and ... - LinkedIn
OOM (Out Of Memory) errors can occur when building and training a neural network model on the GPU. The size of the model...
Read more >
Out of memory error during evaluation but training works fine
I think it fails during validation because the volatile flag is now deprecated and has no effect. Starting from 0.4.0, to avoid the...
Read more >
PPO trainer eating up memory - RLlib - Ray
Hi there, I'm trying to train a PPO agent via self play in my multi-agent env. At the moment it can manage about...
Read more >
Memory Hygiene With TensorFlow During Model Training and ...
The above video clearly shows the out of memory error. TensorFlow aggressively occupies the full GPU memory even though it actually doesn't need...
Read more >
RuntimeError: CUDA out of memory + gpu ... - PyTorch Forums
After the first epoch of training colab throws error related to RuntimeError: CUDA out of memory for batch_size:[8,16,32,64], just before running evaluation ......
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found