question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Why does ALBERT use einsum in PyTorch implementation while in TF one it does not?

See original GitHub issue

❓ Questions & Help

I wanted to learn internals of ALBERT model from your implementation (which is BTW really clean in comparison to the original one - good job!), but I’ve stumbled upon weird looking part in the AlbertAttention: https://github.com/huggingface/transformers/blob/6af3306a1da0322f58861b1fbb62ce5223d97b8a/src/transformers/modeling_albert.py#L258

Why does PyTorch version use einsum-based notation while calculating hidden state (with manual usage of dense layer’s weights), while the TensorFlow version just reshapes the context_layer and does standard “forward” on dense layer?

https://github.com/huggingface/transformers/blob/6af3306a1da0322f58861b1fbb62ce5223d97b8a/src/transformers/modeling_tf_albert.py#L296

I would really like to know the explanation of this implementation - @LysandreJik cloud you shed some light here?

Issue Analytics

  • State:closed
  • Created 3 years ago
  • Comments:8 (6 by maintainers)

github_iconTop GitHub Comments

1reaction
LysandreJikcommented, Oct 7, 2020

For no particular reason, but it might not have been the best choice according to this thread on performance.

1reaction
ZhuBaohecommented, May 7, 2020

The two implementations are equivalent, but the Pytorch version is cumbersome. I think the code

        context_layer = context_layer.permute(0, 2, 1, 3).contiguous()

        # Should find a better way to do this
        w = (
            self.dense.weight.t()
            .view(self.num_attention_heads, self.attention_head_size, self.hidden_size)
            .to(context_layer.dtype)
        )
        b = self.dense.bias.to(context_layer.dtype)

        projected_context_layer = torch.einsum("bfnd,ndh->bfh", context_layer, w) + b

should be rewritten by

        context_layer = context_layer.permute(0, 2, 1, 3).contiguous()
        new_shape = context_layer.size()[:-2] + (-1,)
        context_layer = context_layer.view(*new_shape)

        projected_context_layer = self.dense(context_layer)
Read more comments on GitHub >

github_iconTop Results From Across the Web

Understanding einsum for Deep learning: implement a ...
einsum when I operate on multiple tensors. Axis indexing rules. The difference with einops is that you can use more than single lowercase ......
Read more >
torch.einsum — PyTorch 1.13 documentation
Sums the product of the elements of the input operands along dimensions specified using a notation based on the Einstein summation convention.
Read more >
einsum - an underestimated function | by Chris Lemke
I will use Pytorch's einsum function in the upcoming code, but you may use Numpy's or the one from Tensorflow — they are...
Read more >
python - Understanding NumPy's einsum - Stack Overflow
The great thing about einsum however, is that it does not build a temporary array of products first; it just sums the products...
Read more >
Einsum Is All You Need: NumPy, PyTorch and TensorFlow
In this video I explain how Einstein Summation ( einsum ) works and why it is amazing, at the end of the video...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found