Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

MB-MelGAN training runs out of memory when starting evaluation

See original GitHub issue

I started fine-tuning the multiband_melgan.v1_24k Universal Vocoder with command:

CUDA_VISIBLE_DEVICES=0 python examples/multiband_melgan/train_multiband_melgan.py \
  --train-dir ./dump_LibriTTSFormatted/train/ \
  --dev-dir ./dump_LibriTTSFormatted/valid/ \
  --outdir ./outdir/MBMELGAN/MBMelgan-Tune-Experiment1 \
  --config ./models/multiband_melgan.v1_24k.yaml \
  --use-norm 1 \
  --pretrained ./models/libritts_24k.h5

After 5000 steps it evaluates, at this point, it runs out of memory and stops.

2020-11-05 18:51:28,699 (base_trainer:138) INFO: (Steps: 4864) Finished 19 epoch training (256 steps per epoch).
[train]:   0%|▏                                                                                                                                                                                   | 5000/4000000 [12:05<161:16:34,  6.88it/s]2020-11-05 18:51:48,400 (base_trainer:566) INFO: (Step: 5000) train_adversarial_loss = 0.0000.
2020-11-05 18:51:48,401 (base_trainer:566) INFO: (Step: 5000) train_subband_spectral_convergence_loss = 0.8443.
2020-11-05 18:51:48,401 (base_trainer:566) INFO: (Step: 5000) train_subband_log_magnitude_loss = 0.8513.
2020-11-05 18:51:48,401 (base_trainer:566) INFO: (Step: 5000) train_fullband_spectral_convergence_loss = 0.8818.
2020-11-05 18:51:48,402 (base_trainer:566) INFO: (Step: 5000) train_fullband_log_magnitude_loss = 0.9599.
2020-11-05 18:51:48,402 (base_trainer:566) INFO: (Step: 5000) train_gen_loss = 1.7687.
2020-11-05 18:51:48,402 (base_trainer:566) INFO: (Step: 5000) train_real_loss = 0.0000.
2020-11-05 18:51:48,403 (base_trainer:566) INFO: (Step: 5000) train_fake_loss = 0.0000.
2020-11-05 18:51:48,403 (base_trainer:566) INFO: (Step: 5000) train_dis_loss = 0.0000.
2020-11-05 18:51:48,411 (base_trainer:418) INFO: (Steps: 5000) Start evaluation.
                          2020-11-05 18:51:52.343952: W tensorflow/core/common_runtime/bfc_allocator.cc:246] Allocator (GPU_0_bfc) ran out of memory trying to allocate 449.88MiB with freed_by_count=0. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available.
2020-11-05 18:51:52.381361: E tensorflow/stream_executor/cuda/cuda_fft.cc:249] failed to allocate work area.
2020-11-05 18:51:52.381371: E tensorflow/stream_executor/cuda/cuda_fft.cc:426] Initialize Params: rank: 1 elem_count: 683 input_embed: 683 input_stride: 1 input_distance: 683 output_embed: 342 output_stride: 1 output_distance: 342 batch_count: 86272
2020-11-05 18:51:52.381377: F tensorflow/stream_executor/cuda/cuda_fft.cc:435] failed to initialize batched cufft plan with customized allocator:

Any ideas?

Issue Analytics

State:
Created 3 years ago
Comments:16 (12 by maintainers)

Top GitHub Comments

1reaction

OscarVanLcommented, Nov 6, 2020

I tested this and training now passes the evaluation stage without running out of memory, so the fix worked 😃 Thanks

0reactions

OscarVanLcommented, Nov 7, 2020

@peter05010402 Thanks for checking this

I will create a new issue and try and do some debugging 😃

I would have tried debugging with breakpoints already, but I’ve never done this before on a headless SSH session, and I can’t reproduce the same problem on my own Windows machine. But I’ll investigate how to do this soon.

Top Results From Across the Web

Solving Out Of Memory (OOM) Errors on Keras and ... - LinkedIn

OOM (Out Of Memory) errors can occur when building and training a neural network model on the GPU. The size of the model...

Out of memory error during evaluation but training works fine

I think it fails during validation because the volatile flag is now deprecated and has no effect. Starting from 0.4.0, to avoid the...

PPO trainer eating up memory - RLlib - Ray

Hi there, I'm trying to train a PPO agent via self play in my multi-agent env. At the moment it can manage about...

Memory Hygiene With TensorFlow During Model Training and ...

The above video clearly shows the out of memory error. TensorFlow aggressively occupies the full GPU memory even though it actually doesn't need...

RuntimeError: CUDA out of memory + gpu ... - PyTorch Forums

After the first epoch of training colab throws error related to RuntimeError: CUDA out of memory for batch_size:[8,16,32,64], just before running evaluation ......