Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Request for help for LSHSelfAttention()

See original GitHub issue

Hi @lucidrains thank you for your excellent work (I star it).

I am trying to use the LSHSelfAttention() layer in my network instead of my transformer encoder layer.

A pseudocode of what I am doing is that:

word_embeddings = word_embeddings(input)  # batch, seq_len, emb_dim
lsh_encoded = self.lsh_self_attention(word_embeddings)

I continuously get a vector of NaN values, to avoid it I decrease my learning rate from 1e-3 to 1e-5, but nothing is changed.

Am I using the correct layer?
Should I use Reformer() instead of LSHSelfAttention()? I tried to use Reformer() but I also get an error there, which tells me that my sequence is divisible by the number of buckets (I’m still working on it).

Issue Analytics

State:
Created 3 years ago
Reactions:1
Comments:22 (13 by maintainers)

Top GitHub Comments

1reaction

lucidrainscommented, Dec 11, 2020

Woohoo! Congrats 💯

1reaction

andreabac3commented, Dec 10, 2020

@lucidrains I am using pytorch_lightning==0.8.5 so I suppose, looking in the default parameter in Trainer() class, it’s set as O2

amp_level: str = 'O2',   # backward compatible, todo: remove in v1.0.0

which is located in $HOME/.local/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py

PS: It’s works with lr = 1e-3 without problem.

Top Results From Across the Web

Illustrated: Self-Attention

This article walks you through the mathematical operations in a self-attention module. Includes illustrations and code.

Complete Self-Attention from Scratch

Complete Self-Attention from Scratch. This vignette describes how to implement the attention mechanism - which forms the basis of transformers - in the...

tf.keras.layers.Attention | TensorFlow v2.11.0

Python boolean indicating whether the layer should behave in training mode (adding dropout) or in inference mode (no dropout). use_causal_mask ...

Attention? Attention! - Lil'Log

Self -attention, also known as intra-attention, is an attention mechanism relating different positions of a single sequence in order to compute a ...

The Transformer Attention Mechanism

… · Self-attention, sometimes called intra-attention, is an attention mechanism relating different positions of a single sequence in order to ...