`use_linear_attn = True` produce noise and unstable loss
See original GitHub issueAfter moving from v0.0.60. to v0.1.10, I found the Imagen loss is unstable in the early training steps and the results is noisy from the early stage.
The problem is gone when I set use_linear_attn = False
Issue Analytics
- State:
- Created a year ago
- Reactions:1
- Comments:5 (3 by maintainers)
Top Results From Across the Web
How to correct unstable loss and accuracy during training ...
Data augmentation - I would try this one first, mostly because of curiosity. As your features are continuous you may want to add...
Read more >Why does the loss/accuracy fluctuate during the training ...
There are several reasons that can cause fluctuations in training loss over epochs. The main one though is the fact that almost all...
Read more >Interpreting Loss Curves | Machine Learning
If training looks unstable, as in this plot, then reduce your learning rate to prevent the model from bouncing around in parameter space....
Read more >A Gentle Introduction to Exploding Gradients in Neural Networks
The model is unable to get traction on your training data (e.g. poor loss). The model is unstable, resulting in large changes in...
Read more >Very volatile validation loss - Deep Learning - Fast.ai forums
Hi, I'm training a dense CNN model and noticed that If I pick too high of a learning rate I get better validation...
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
@ken012git thank you for the experiments! basically, in a lot of papers, researchers remove attention past a certain token length (1024 or 2048) since it is prohibitively expensive due to the quadratic compute. but i like to substitute them with linear attention, even if it is a bit weaker. my favorite linear attention remains https://arxiv.org/abs/1812.01243 , and here i am also giving it a depthwise conv recommended by the primer paper
@ken012git forgot the residual 🤦 and also needed a feedforward after it anyways