Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Possible problems with weight initialization in NeMo ASR[Question]

See original GitHub issue

Describe your question

Weight initialization in the ConvASREncoder and ConvASRDecoder defaults to xavier_uniform but the architectures use ReLU which does best with kaiming initialization. Why was xavier initialization chosen?
ConvASREncoder and ConvASRDecoder have an init_mode argument that delegates to nemo.collections.asr.parts.jasper.init_weights which returns different results than PyTorch’s nn.init (the weights have a different initial standard deviation) and results in significantly worse training during transfer learning in my experiments. Why aren’t PyTorch defaults used?

Experimental Results**

I tried transfer learning from quartznet to a dataset with a different vocab, experimenting with 1 or 2 linear layers in my ASR decoder (decoder layer code included at end of post). I tried initializing decoder weights using NeMo’s xavier_uniform, NeMo’s kaiming_uniform and pytorch defaults (kaiming uniform is the default for 1d convs). I ran 12 trials for each. 6 were 2 epochs, and 6 were 1 epoch. LR=1e-3 (0.001). Mean loss after 1ep, and 2ep is included below.

PyTorch Kaiming Uniform, mean loss 1ep=289.9, 2ep=179.9

pytorch_default(kaiming_uniform)

NeMo Kaiming Uniform: mean loss 1ep=515.5, 2ep=581.6

kaiming_uniform

NeMo Xavier Uniform (default): mean loss 1ep=566.8, 2ep=408.0

xavier_uniform

Standard Deviation of weights after initialization

2 Layer Decoder:

1 Layer Decoder:

Environment Details

Colab Pip Install - pip install nemo-toolkit[all]==1.0.0b1

Python 3.6.9 Pytorch 1.7 OS: Ubuntu 18.04.5 LTS

Additional Details

Definition of decoders

2 Layer Decoder

N_HIDDEN = 256
self.decoder_layers = torch.nn.Sequential(
              torch.nn.Conv1d(self._feat_in, N_HIDDEN, kernel_size=1, bias=True),
              torch.nn.ReLU(),
              torch.nn.Conv1d(N_HIDDEN, self._num_classes, kernel_size=1, bias=True),
          )

1 Layer Decoder

self.decoder_layers = torch.nn.Sequential(
              torch.nn.Conv1d(self._feat_in, self._num_classes, kernel_size=1, bias=True),
          )

Issue Analytics

State:
Created 3 years ago
Comments:9 (4 by maintainers)

Top GitHub Comments

2reactions

rbraccocommented, Nov 20, 2020

Awesome work @titu1994! Good to see that it isn’t a problem for training from scratch. I’m not sure but maybe batchnorm lessens the importance of init since they all seem to end up in the same place. I will keep experimenting with transfer learning and report back. I just switched my training from English to Spanish with a totally different dataset and vocab, so I will try several inits on the new set and see if it is similar to what I experienced before, or if it was just a fluke. I should be able to report back early next week.

1reaction

titu1994commented, Nov 17, 2020

I think I might have an idea as to why applying the default is different as compared to applying kaiming_uniform

This is the default implementation of pytorch for all convND

    def reset_parameters(self) -> None:
        init.kaiming_uniform_(self.weight, a=math.sqrt(5))
        if self.bias is not None:
            fan_in, _ = init._calculate_fan_in_and_fan_out(self.weight)
            bound = 1 / math.sqrt(fan_in)
            init.uniform_(self.bias, -bound, bound)

Note the a=sqrt(5) as the param and the default nonlinearity value of leaky_relu.

For the kaiming_uniform mode in nemo - we compute the gain using the relu activation - as expected, if we dive deeper into what this gain value is actually computed into it can be found as

nn.init.kaiming_uniform_(m.weight, nonlinearity="relu")

which resolves to a=0 and different nonlinearity

def kaiming_uniform_(tensor, a=0, mode='fan_in', nonlinearity='leaky_relu'):
    ...
    fan = _calculate_correct_fan(tensor, mode)
    gain = calculate_gain(nonlinearity, a)
    std = gain / math.sqrt(fan)
    bound = math.sqrt(3.0) * std  # Calculate uniform bounds from standard deviation
    with torch.no_grad():
        return tensor.uniform_(-bound, bound)

Herein lies the difference in the gain computation

def calculate_gain(nonlinearity, param=None):
    ...
    linear_fns = ['linear', 'conv1d', 'conv2d', 'conv3d', 'conv_transpose1d', 'conv_transpose2d', 'conv_transpose3d']
    if nonlinearity in linear_fns or nonlinearity == 'sigmoid':
        return 1
    elif nonlinearity == 'tanh':
        return 5.0 / 3
    elif nonlinearity == 'relu':
        return math.sqrt(2.0)
    elif nonlinearity == 'leaky_relu':
        if param is None:
            negative_slope = 0.01
        elif not isinstance(param, bool) and isinstance(param, int) or isinstance(param, float):
            # True/False are instances of int, hence check above
            negative_slope = param
        else:
            raise ValueError("negative_slope {} not a valid number".format(param))
        return math.sqrt(2.0 / (1 + negative_slope ** 2))
    else:
        raise ValueError("Unsupported nonlinearity {}".format(nonlinearity))

Now, lets manually compute the output of the compute_gain method for default and kaiming_uniform init_mode in nemo

default = compute_gain(‘leaky_relu’, param=sqrt(5)) = sqrt(2.0 / (1. + 5.)) = sqrt(1./3.) kaiming_uniform = compute_gain(‘relu’, param=0) = sqrt(2.0)

This is the reason the value of default does not match kaiming_uniform.

Top Results From Across the Web

3 Common Problems with Neural Network Initialization

A common machine learning modelling process goes by (1) weight initialization, (2) forward propagation, (3) loss (or cost) computation, (4) backpropagation, ...

Weight Initialization in Neural Network, inspired by Andrew Ng

This problem is known as network failing to break symmetry. And not only zero, any constant initialization will produce a poor result.

Danger of setting all initial weights to zero in Backpropagation

Main problem with initialization of all weights to zero mathematically leads to either the neuron values are zero (for multi layers) or the...

Weight Initialization for Deep Learning Neural Networks

This optimization algorithm requires a starting point in the space of possible weight values from which to begin the optimization process.

A Gentle Introduction To Weight Initialization for Neural ...

1 above for instance, depending on where the deep learning model starts in the training process, it can converge to any of the...