question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Training hangs on parallel training

See original GitHub issue

Hi!

I’ve been trying to run a CommonVoice Transducer recipe with French with --data_parallel_backend option. At the beginning, it starts running correctly but after ~30 mins, the training hangs and it stops iterating over training examples. I am running this model on a GPU cluster with 4 TitanRTXs.

I am not really sure if this is a problem related to hardware/CUDA setup, too large batch size, or something that goes internally wrong with Speechbrain model training? Has someone encountered this error?

Here are the training logs. The training step 4177/11783 is stuck like that for the past 20 hours.

speechbrain.tokenizers.SentencePiece - ==== Loading Tokenizer ===
speechbrain.tokenizers.SentencePiece - Tokenizer path: results/cv_transducer/2137/save/1000_unigram.model
speechbrain.tokenizers.SentencePiece - Tokenizer vocab_size: 1000
speechbrain.tokenizers.SentencePiece - Tokenizer type: unigram
speechbrain.core - Info: ckpt_interval_minutes arg from hparam file is used
speechbrain.core - 142.4M trainable parameters in ASR
speechbrain.utils.checkpoints - Would load a checkpoint here, but none found yet.
speechbrain.utils.epoch_loop - Going into epoch 1
  0%|          | 0/11783 [00:00<?, ?it/s]/sw/arch/Debian10/EB_production/2021/software/Anaconda3/2021.05/lib/python3.8/site-packages/numba/cuda/envvars.py:17: NumbaWarning: 
Environment variables with the 'NUMBAPRO' prefix are deprecated and consequently ignored, found use of NUMBAPRO_NVVM=/usr/lib/x86_64-linux-gnu/libnvvm.so.

For more information about alternatives visit: ('https://numba.pydata.org/numba-doc/latest/cuda/overview.html', '#cudatoolkit-lookup')
  warnings.warn(errors.NumbaWarning(msg))
numba.cuda.cudadrv.driver - init         
 35%|###5      | 4177/11783 [1:01:47<2:13:36,  1.05s/it, train_loss=0.856]

This is how I run the code with computing cluster that I use (uses SLURM commands).

#!/bin/bash

#SBATCH --partition=gpu_titanrtx
#SBATCH --gres=gpu:4
#SBATCH --job-name=TestJob
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=3
#SBATCH --time=72:15:00
#SBATCH --mem=32000M
#SBATCH --output=slurm_output_%A.out

module purge
module load 2019
module load 2021
module load Anaconda3/2021.05
module load cuDNN/7.3.1-CUDA-10.0.130

#Activate your environment
source activate sb
export NUMBAPRO_NVVM='/usr/lib/x86_64-linux-gnu/libnvvm.so'
## Run your code
## DataParallel
CUDA_VISIBLE_DEVICES=0,1,2,3 python -u ~/repos/speechbrain/recipes/CommonVoice/ASR/transducer/train.py /home/kubara/repos/speechbrain/recipes/CommonVoice/ASR/transducer/hparams/train_fr.yaml --data_parallel_backend

Here is also the hparams_fr.yaml setup. I only changed the batch size for training and validation and set correct paths to data

# ############################################################################
# Model: E2E ASR with attention-based ASR
# Encoder: CRDNN model
# Decoder: GRU + beamsearch + RNNLM
# Tokens: BPE with unigram
# losses: Transducer
# Training: Librispeech 100h
# Authors:  Abdel HEBA, Mirco Ravanelli, Sung-Lin Yeh 2020
# ############################################################################

# Seed needs to be set at top of yaml, before objects with parameters are made
seed: 2137
__set_seed: !!python/object/apply:torch.manual_seed [!ref <seed>]
output_folder: !ref results/cv_transducer/<seed>
wer_file: !ref <output_folder>/wer.txt
save_folder: !ref <output_folder>/save
train_log: !ref <output_folder>/train_log.txt

# Data files
data_folder: !ref /home/kubara/data/cv-corpus-7.0-2021-07-21/fr  # e.g, /localscratch/cv-corpus-5.1-2020-06-22/fr
train_tsv_file: !ref <data_folder>/train.tsv  # Standard CommonVoice .tsv files
dev_tsv_file: !ref <data_folder>/dev.tsv  # Standard CommonVoice .tsv files
test_tsv_file: !ref <data_folder>/test.tsv  # Standard CommonVoice .tsv files
accented_letters: True
language: fr # use 'it' for Italian, 'rw' for Kinyarwanda, 'en' for english
train_csv: !ref <save_folder>/train.csv
valid_csv: !ref <save_folder>/dev.csv
test_csv: !ref <save_folder>/test.csv
skip_prep: False # Skip data preparation

# We remove utterance slonger than 10s in the train/dev/test sets as
# longer sentences certainly correspond to "open microphones".
avoid_if_longer_than: 10.0

# Training parameters
number_of_epochs: 40
batch_size: 32
batch_size_valid: 16
lr: 1.0
sorting: ascending
ckpt_interval_minutes: 15 # save checkpoint every N min
# MTL for encoder with CTC (uncomment enc_lin layer)
#number_of_ctc_epochs: 2
#ctc_weight: 0.33
# MTL for decoder with CE (uncomment dec_lin layer)
#number_of_ce_epochs: 2
#ce_weight: 0.33

# Feature parameters
sample_rate: 16000
n_fft: 400
n_mels: 80

opt_class: !name:torch.optim.Adadelta
   lr: !ref <lr>
   rho: 0.95
   eps: 1.e-8

# BPE parameters
token_type: unigram  # ["unigram", "bpe", "char"]
character_coverage: 1.0

# Dataloader options
train_dataloader_opts:
   batch_size: !ref <batch_size>

valid_dataloader_opts:
   batch_size: !ref <batch_size_valid>

test_dataloader_opts:
   batch_size: !ref <batch_size_valid>

# Model parameters
activation: !name:torch.nn.LeakyReLU
dropout: 0.15
cnn_blocks: 3
cnn_channels: (128, 200, 256)
inter_layer_pooling_size: (2, 2, 2)
cnn_kernelsize: (3, 3)
time_pooling_size: 4
rnn_class: !name:speechbrain.nnet.RNN.LSTM
rnn_layers: 5
rnn_neurons: 1024
rnn_bidirectional: True
dnn_blocks: 2
dnn_neurons: 1024
dec_neurons: 1024
output_neurons: 1000  # index(blank/eos/bos) = 0
joint_dim: 1024
blank_index: 0

# Decoding parameters
beam_size: 4
nbest: 1
# by default {state,expand}_beam = 2.3 as mention in paper
# https://arxiv.org/abs/1904.02619
state_beam: 2.3
expand_beam: 2.3

epoch_counter: !new:speechbrain.utils.epoch_loop.EpochCounter
   limit: !ref <number_of_epochs>

normalize: !new:speechbrain.processing.features.InputNormalization
   norm_type: global

compute_features: !new:speechbrain.lobes.features.Fbank
   sample_rate: !ref <sample_rate>
   n_fft: !ref <n_fft>
   n_mels: !ref <n_mels>

# Frequency domain SpecAugment
augmentation: !new:speechbrain.lobes.augment.SpecAugment
   time_warp: True
   time_warp_window: 5
   time_warp_mode: bicubic
   freq_mask: True
   n_freq_mask: 2
   time_mask: True
   n_time_mask: 2
   replace_with_zero: False
   freq_mask_width: 30
   time_mask_width: 40

enc: !new:speechbrain.lobes.models.CRDNN.CRDNN
   input_shape: [null, null, !ref <n_mels>]
   activation: !ref <activation>
   dropout: !ref <dropout>
   cnn_blocks: !ref <cnn_blocks>
   cnn_channels: !ref <cnn_channels>
   cnn_kernelsize: !ref <cnn_kernelsize>
   inter_layer_pooling_size: !ref <inter_layer_pooling_size>
   time_pooling: True
   using_2d_pooling: False
   time_pooling_size: !ref <time_pooling_size>
   rnn_class: !ref <rnn_class>
   rnn_layers: !ref <rnn_layers>
   rnn_neurons: !ref <rnn_neurons>
   rnn_bidirectional: !ref <rnn_bidirectional>
   rnn_re_init: True
   dnn_blocks: !ref <dnn_blocks>
   dnn_neurons: !ref <dnn_neurons>

# For MTL CTC over the encoder
# enc_lin: !new:speechbrain.nnet.linear.Linear
#     input_size: !ref <dnn_neurons>
#     n_neurons: !ref <joint_dim>
#
# ctc_cost: !name:speechbrain.nnet.ctc_loss
#    blank_index: !ref <blank_index>

emb: !new:speechbrain.nnet.embedding.Embedding
   num_embeddings: !ref <output_neurons>
   consider_as_one_hot: True
   blank_id: !ref <blank_index>

dec: !new:speechbrain.nnet.RNN.GRU
   input_shape: [null, null, !ref <output_neurons> - 1]
   hidden_size: !ref <dec_neurons>
   num_layers: 1
   re_init: True

# For MTL with LM over the decoder
# dec_lin: !new:speechbrain.nnet.linear.Linear
#     input_size: !ref <dec_neurons>
#     n_neurons: !ref <joint_dim>
#     bias: False
#
# ce_cost: !name:speechbrain.nnet.nll_loss
#    label_smoothing: 0.1

Tjoint: !new:speechbrain.nnet.transducer.transducer_joint.Transducer_joint
   joint: sum # joint [sum | concat]
   nonlinearity: !ref <activation>

transducer_lin: !new:speechbrain.nnet.linear.Linear
   input_size: !ref <joint_dim>
   n_neurons: !ref <output_neurons>
   bias: False

log_softmax: !new:speechbrain.nnet.activations.Softmax
   apply_log: True

transducer_cost: !name:speechbrain.nnet.losses.transducer_loss
   blank_index: !ref <blank_index>

# for MTL
# update model if any HEAD module is added
modules:
   enc: !ref <enc>
   emb: !ref <emb>
   dec: !ref <dec>
   Tjoint: !ref <Tjoint>
   transducer_lin: !ref <transducer_lin>
   normalize: !ref <normalize>
   augmentation: !ref <augmentation>

# for MTL
# update model if any HEAD module is added
model: !new:torch.nn.ModuleList
   - [!ref <enc>, !ref <emb>, !ref <dec>, !ref <transducer_lin>]

greedy_searcher: !new:speechbrain.decoders.transducer.TransducerBeamSearcher
   decode_network_lst: [!ref <emb>, !ref <dec>]
   tjoint: !ref <Tjoint>
   classifier_network: [!ref <transducer_lin>]
   blank_id: !ref <blank_index>
   beam_size: 1
   nbest: 1

beam_searcher: !new:speechbrain.decoders.transducer.TransducerBeamSearcher
   decode_network_lst: [!ref <emb>, !ref <dec>]
   tjoint: !ref <Tjoint>
   classifier_network: [!ref <transducer_lin>]
   blank_id: !ref <blank_index>
   beam_size: !ref <beam_size>
   nbest: !ref <nbest>
   state_beam: !ref <state_beam>
   expand_beam: !ref <expand_beam>

lr_annealing: !new:speechbrain.nnet.schedulers.NewBobScheduler
   initial_value: !ref <lr>
   improvement_threshold: 0.0025
   annealing_factor: 0.8
   patient: 0

checkpointer: !new:speechbrain.utils.checkpoints.Checkpointer
   checkpoints_dir: !ref <save_folder>
   recoverables:
      model: !ref <model>
      scheduler: !ref <lr_annealing>
      normalizer: !ref <normalize>
      counter: !ref <epoch_counter>

train_logger: !new:speechbrain.utils.train_logger.FileTrainLogger
   save_file: !ref <train_log>

error_rate_computer: !name:speechbrain.utils.metric_stats.ErrorRateStats

cer_computer: !name:speechbrain.utils.metric_stats.ErrorRateStats
   split_tokens: True

Issue Analytics

  • State:closed
  • Created 2 years ago
  • Comments:5

github_iconTop GitHub Comments

1reaction
mravanellicommented, Dec 7, 2021

Yes, you might wanna play with the batch size a bit.

On Tue, 7 Dec 2021 at 08:23, Kacper Kubara @.***> wrote:

My guess is that it perhaps might be something related to silent fail with CUDA OOM. Once I reverse the sorting: ascending to sorting: descending I get an OOM straightaway.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/speechbrain/speechbrain/issues/1182#issuecomment-987921793, or unsubscribe https://github.com/notifications/unsubscribe-auth/AEA2ZVQDUTT2E5AYD4TUXFTUPYDDLANCNFSM5JQZ452Q .

0reactions
KacperKubaracommented, Dec 13, 2021

Ok great, thanks. I decreased the batch size even further and I am now able to train it.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Parallel training hangs · Issue #1 · stas00/toolbox
My four new GPUs hang when trying to fine tune a transformer, and they appear to do the same thing when ... Parallel...
Read more >
Distributed training with DDP hangs
I am attempting to use DistributedDataParallel for single-node, multi-GPU training in a SageMaker Studio multi-GPU instance environment, ...
Read more >
Distributed data parallel training using Pytorch on the multiple ...
DistributedDataParallel to use multiple gpus in a single node and multiple nodes during the training respectively. However, it is recommended by PyTorch to...
Read more >
Ray Train hangs for long time
I am using Ray Train for hyperparameter tuning. The config is shown below def main(args, num_samples=2): trainer = Trainer( "torch", ...
Read more >
Process stuck when training on multiple nodes using ...
I am trying to run the script mnist-distributed.py from Distributed data parallel training in Pytorch. I have also pasted the same code here ......
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found