Possible problems with weight initialization in NeMo ASR[Question]
See original GitHub issueDescribe your question
- Weight initialization in the
ConvASREncoder
andConvASRDecoder
defaults toxavier_uniform
but the architectures use ReLU which does best with kaiming initialization. Why was xavier initialization chosen? ConvASREncoder
andConvASRDecoder
have aninit_mode
argument that delegates tonemo.collections.asr.parts.jasper.init_weights
which returns different results than PyTorch’s nn.init (the weights have a different initial standard deviation) and results in significantly worse training during transfer learning in my experiments. Why aren’t PyTorch defaults used?
Experimental Results**
I tried transfer learning from quartznet to a dataset with a different vocab, experimenting with 1 or 2 linear layers in my ASR decoder (decoder layer code included at end of post). I tried initializing decoder weights using NeMo’s xavier_uniform
, NeMo’s kaiming_uniform
and pytorch defaults (kaiming uniform is the default for 1d convs). I ran 12 trials for each. 6 were 2 epochs, and 6 were 1 epoch. LR=1e-3 (0.001). Mean loss after 1ep, and 2ep is included below.
PyTorch Kaiming Uniform, mean loss 1ep=289.9, 2ep=179.9
NeMo Kaiming Uniform: mean loss 1ep=515.5, 2ep=581.6
NeMo Xavier Uniform (default): mean loss 1ep=566.8, 2ep=408.0
Standard Deviation of weights after initialization
2 Layer Decoder:
1 Layer Decoder:
Environment Details
Colab Pip Install - pip install nemo-toolkit[all]==1.0.0b1
Python 3.6.9 Pytorch 1.7 OS: Ubuntu 18.04.5 LTS
Additional Details
Definition of decoders
2 Layer Decoder
N_HIDDEN = 256
self.decoder_layers = torch.nn.Sequential(
torch.nn.Conv1d(self._feat_in, N_HIDDEN, kernel_size=1, bias=True),
torch.nn.ReLU(),
torch.nn.Conv1d(N_HIDDEN, self._num_classes, kernel_size=1, bias=True),
)
1 Layer Decoder
self.decoder_layers = torch.nn.Sequential(
torch.nn.Conv1d(self._feat_in, self._num_classes, kernel_size=1, bias=True),
)
Issue Analytics
- State:
- Created 3 years ago
- Comments:9 (4 by maintainers)
Top Results From Across the Web
3 Common Problems with Neural Network Initialization
A common machine learning modelling process goes by (1) weight initialization, (2) forward propagation, (3) loss (or cost) computation, (4) backpropagation, ...
Read more >Weight Initialization in Neural Network, inspired by Andrew Ng
This problem is known as network failing to break symmetry. And not only zero, any constant initialization will produce a poor result.
Read more >Danger of setting all initial weights to zero in Backpropagation
Main problem with initialization of all weights to zero mathematically leads to either the neuron values are zero (for multi layers) or the...
Read more >Weight Initialization for Deep Learning Neural Networks
This optimization algorithm requires a starting point in the space of possible weight values from which to begin the optimization process.
Read more >A Gentle Introduction To Weight Initialization for Neural ...
1 above for instance, depending on where the deep learning model starts in the training process, it can converge to any of the...
Read more >
Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free
Top Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Awesome work @titu1994! Good to see that it isn’t a problem for training from scratch. I’m not sure but maybe batchnorm lessens the importance of init since they all seem to end up in the same place. I will keep experimenting with transfer learning and report back. I just switched my training from English to Spanish with a totally different dataset and vocab, so I will try several inits on the new set and see if it is similar to what I experienced before, or if it was just a fluke. I should be able to report back early next week.
I think I might have an idea as to why applying the default is different as compared to applying
kaiming_uniform
This is the
default
implementation of pytorch for all convNDNote the
a=sqrt(5)
as the param and the defaultnonlinearity
value ofleaky_relu
.For the
kaiming_uniform
mode in nemo - we compute the gain using therelu
activation - as expected, if we dive deeper into what this gain value is actually computed into it can be found aswhich resolves to
a=0
and differentnonlinearity
Herein lies the difference in the gain computation
Now, lets manually compute the output of the
compute_gain
method fordefault
andkaiming_uniform
init_mode in nemodefault
= compute_gain(‘leaky_relu’, param=sqrt(5)) = sqrt(2.0 / (1. + 5.)) = sqrt(1./3.)kaiming_uniform
= compute_gain(‘relu’, param=0) = sqrt(2.0)This is the reason the value of
default
does not matchkaiming_uniform
.