Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Why does ALBERT use einsum in PyTorch implementation while in TF one it does not?

See original GitHub issue

❓ Questions & Help

I wanted to learn internals of ALBERT model from your implementation (which is BTW really clean in comparison to the original one - good job!), but I’ve stumbled upon weird looking part in the AlbertAttention: https://github.com/huggingface/transformers/blob/6af3306a1da0322f58861b1fbb62ce5223d97b8a/src/transformers/modeling_albert.py#L258

Why does PyTorch version use einsum-based notation while calculating hidden state (with manual usage of dense layer’s weights), while the TensorFlow version just reshapes the context_layer and does standard “forward” on dense layer?

https://github.com/huggingface/transformers/blob/6af3306a1da0322f58861b1fbb62ce5223d97b8a/src/transformers/modeling_tf_albert.py#L296

I would really like to know the explanation of this implementation - @LysandreJik cloud you shed some light here?

Issue Analytics

State:
Created 3 years ago
Comments:8 (6 by maintainers)

Top GitHub Comments

1reaction

LysandreJikcommented, Oct 7, 2020

For no particular reason, but it might not have been the best choice according to this thread on performance.

1reaction

ZhuBaohecommented, May 7, 2020

The two implementations are equivalent, but the Pytorch version is cumbersome. I think the code

        context_layer = context_layer.permute(0, 2, 1, 3).contiguous()

        # Should find a better way to do this
        w = (
            self.dense.weight.t()
            .view(self.num_attention_heads, self.attention_head_size, self.hidden_size)
            .to(context_layer.dtype)
        )
        b = self.dense.bias.to(context_layer.dtype)

        projected_context_layer = torch.einsum("bfnd,ndh->bfh", context_layer, w) + b

should be rewritten by

        context_layer = context_layer.permute(0, 2, 1, 3).contiguous()
        new_shape = context_layer.size()[:-2] + (-1,)
        context_layer = context_layer.view(*new_shape)

        projected_context_layer = self.dense(context_layer)

Top Results From Across the Web

Understanding einsum for Deep learning: implement a ...

einsum when I operate on multiple tensors. Axis indexing rules. The difference with einops is that you can use more than single lowercase ......

torch.einsum — PyTorch 1.13 documentation

Sums the product of the elements of the input operands along dimensions specified using a notation based on the Einstein summation convention.

einsum - an underestimated function | by Chris Lemke

I will use Pytorch's einsum function in the upcoming code, but you may use Numpy's or the one from Tensorflow — they are...

python - Understanding NumPy's einsum - Stack Overflow

The great thing about einsum however, is that it does not build a temporary array of products first; it just sums the products...

Einsum Is All You Need: NumPy, PyTorch and TensorFlow

In this video I explain how Einstein Summation ( einsum ) works and why it is amazing, at the end of the video...

Troubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.

Start Free

Top Related Reddit Thread

No results found

Top Related Tweet

No results found

Top Related Dev.to Post

No results found

Why does ALBERT use einsum in PyTorch implementation while in TF one it does not?

❓ Questions & Help

Issue Analytics

Top GitHub Comments

Top Results From Across the Web

Top Related Medium Post

Top Related StackOverflow Question

Troubleshoot Live Code

Top Related Reddit Thread

Top Related Hackernoon Post

Top Related Tweet

Top Related Dev.to Post

Top Related Hashnode Post

Use finetuned-BART large to do conditional generation

ValueError: You are attempting to pad samples but the tokenizer you are using (GPT2Tokenizer) does not have one.