Why inplace operation in `AdaptiveSoftmax`
See original GitHub issueI used log_output
for training purpose, but pytorch complains about inplace operation on backward
.
For lines 204 and 205: https://github.com/pytorch/fairseq/blob/e6422528dae0b899848469efe2dc404c1e639ce9/fairseq/modules/adaptive_softmax.py#L200-L208
- why doing a
copy_
on tailout - why inplace add_
I changed to
tail_output = ...
... = self.lsm(tail_out) + tail_priors[idxs, i, None]
And it works for training.
Issue Analytics
- State:
- Created 4 years ago
- Reactions:1
- Comments:6 (3 by maintainers)
Top Results From Across the Web
How to Overcome the Large Vocabulary Bottleneck Using an ...
It is a plug and play substitute for the regular softmax. One can simply replace the final softmax with an adaptive one. There...
Read more >AdaptiveLogSoftmaxWithLoss — PyTorch 1.13 documentation
Adaptive softmax partitions the labels into several clusters, according to their frequency. These clusters may contain different number of targets each.
Read more >[1609.04309] Efficient softmax approximation for GPUs - arXiv
Our approach, called adaptive softmax, circumvents the linear dependency on the vocabulary size by exploiting the unbalanced word distribution ...
Read more >Adaptive Softmax explained in Numpy | Analytics Vidhya
So in Head , the 300 is the input dimension and the 3002 consists of two parts , 3000 is the number of...
Read more >Source code for fairseq.modules.adaptive_softmax
[docs]class AdaptiveSoftmax(nn.Module): """ This is an implementation of the efficient softmax approximation for graphical processing units (GPU), ...
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
training
mode.torch.no_grad
env, the memory overhead should equal to thetail_priors[idxs, i ,None]
(temporarily for the out-of-placeadd
), (edit: plus the size oflog_probs[idxs, start:end]
for re-use as storage ofself.tail[i](input[idxs]))
.Actually
add
operation does not require saving operands for future BP, but pytorch does raise an error for any inplace operators. The reason is that using in-place operation is quite dangerous for chain-rule BP.I recommend changing to an out-of-place version, I can open a PR and give a simple comparison in terms of memory.
If the memory saving is very critical then I would suggest implementing a
Function
to by-pass the in-place check mechanism.Closing this issue after a prolonged period of inactivity. If this issue is still present in the latest release, please create a new issue with up-to-date information. Thank you!