question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

there is no need to rewrite the 'class LayerNorm(nn.Module)'

See original GitHub issue

The reason to rewrite the ‘class LayerNorm(nn.Module)’ is that you think the layer normal provided by PyTorch only supports ‘channels_last’ format (batch_size, height, width, channels), so you rewrite a new way to support ‘channels_first’ format (batch_size, channels, height, width). However, I found the F.layer_norm or nn.LayerNorm do not require the order of channels, height and width. Because F.layer_norm will derive the calculated dimensions from the last dim using ‘normalized_shape’ to calculate the mean and variance.

Specifically, the PyTorch implementation uses the every value in a image to calculate a pair of mean and variance, and every value in the image use this two numbers to do LayerNorm. But your implementation uses the values over channels in every spatial point to get a pair of mean and variance in every spatial point.

When I changed the following codes in convnext.py, I found I do the same thing as ‘F.layer_norm’ or ‘nn.LayerNorm’ by PyTorch. https://github.com/facebookresearch/ConvNeXt/blob/d1fa8f6fef0a165b27399986cc2bdacc92777e40/models/convnext.py#L119

u = a.mean([1, 2, 3], keepdim=True)
# u = x.mean(1, keepdim=True)  # original code
s = (x - u).pow(2).mean([1, 2, 3], keepdim=True)
# s = (x - u).pow(2).mean(1, keepdim=True)  # original code
x = self.weight[None, :] * x + self.bias[None, :]
# x = self.weight[:, None, None] * x + self.bias[:, None, None]  # original code

There is no need to rewrite the ‘class LayerNorm(nn.Module)’, it’s just a misunderstanding about LayerNorm implementation.

Issue Analytics

  • State:open
  • Created a year ago
  • Comments:5 (2 by maintainers)

github_iconTop GitHub Comments

1reaction
ppwwyyxxcommented, Sep 1, 2022

FYI, LayerNorm paper’s section 6.7 talks about CNNs. Although it does not clearly say how it is applied to (N, C, H, W), the words does have some hints:

With fully connected layers, all the hidden units in a layer tend to make similar contributions to the final prediction and re-centering and rescaling the summed inputs to a layer works well. However, the assumption of similar contributions is no longer true for convolutional neural networks. The large number of the hidden units whose receptive fields lie near the boundary of the image are rarely turned on and thus have very different statistics from the rest of the hidden units within the same layer.

My reading of it is that the “original” LayerNorm does normalize over (C, H, W) (and they think this might not be a good idea).

Although today in Transformer’s point of view, H and W becomes “sequence” and then it becomes natural to normalize only on C dimension. And btw, “positional normalization” https://arxiv.org/pdf/1907.04312.pdf seem to be the first one to formally name such an operation for CNN.

0reactions
liuzhuang13commented, Jul 7, 2022

If I understand correctly, normalizing all C,H,W dimensions is equivalent to a GroupNorm with #groups=1. We haven’t got a chance to try this though. The Poolformer paper uses this as their default

Read more comments on GitHub >

github_iconTop Results From Across the Web

LayerNorm — PyTorch 1.13 documentation
If a single integer is used, it is treated as a singleton list, and this module will normalize over the last dimension which...
Read more >
Writing better code with pytorch+einops
Rewriting the code helped to identify: There is no sense in doing reshuffling and not using groups in the first convolution (indeed, in...
Read more >
torch_geometric.nn — pytorch_geometric documentation
If applicable, this saves both time and memory since messages do not explicitly need to be materialized. This function will only gets called...
Read more >
Is there a numerical error in Pytorch nn.LayerNorm?
LayerNorm to a tensor with elements all equal, the result is expected to be a tensor of all zeros (since X - E[X]...
Read more >
Building an Image Classifier with Differential Privacy - Opacus
In this tutorial, it is set to $10^{−5}$ as the CIFAR10 dataset has ... that do not have this issue: LayerNorm, InstanceNorm and...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found