Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Why inplace operation in `AdaptiveSoftmax`

See original GitHub issue

I used log_output for training purpose, but pytorch complains about inplace operation on backward.

For lines 204 and 205: https://github.com/pytorch/fairseq/blob/e6422528dae0b899848469efe2dc404c1e639ce9/fairseq/modules/adaptive_softmax.py#L200-L208

why doing a copy_ on tailout
why inplace add_

I changed to

tail_output = ...
... = self.lsm(tail_out) + tail_priors[idxs, i, None]

And it works for training.

Issue Analytics

State:
Created 4 years ago
Reactions:1
Comments:6 (3 by maintainers)

Top GitHub Comments

1reaction

Stonesjtucommented, Jun 1, 2019

The current implementation is not working in training mode.
In evaluation mode, if we set the torch.no_grad env, the memory overhead should equal to the tail_priors[idxs, i ,None] (temporarily for the out-of-place add), (edit: plus the size of log_probs[idxs, start:end] for re-use as storage of self.tail[i](input[idxs])).

Actually add operation does not require saving operands for future BP, but pytorch does raise an error for any inplace operators. The reason is that using in-place operation is quite dangerous for chain-rule BP.

I recommend changing to an out-of-place version, I can open a PR and give a simple comparison in terms of memory.

If the memory saving is very critical then I would suggest implementing a Function to by-pass the in-place check mechanism.

0reactions

stale[bot]commented, Apr 27, 2022

Closing this issue after a prolonged period of inactivity. If this issue is still present in the latest release, please create a new issue with up-to-date information. Thank you!

Top Results From Across the Web

How to Overcome the Large Vocabulary Bottleneck Using an ...

It is a plug and play substitute for the regular softmax. One can simply replace the final softmax with an adaptive one. There...

AdaptiveLogSoftmaxWithLoss — PyTorch 1.13 documentation

Adaptive softmax partitions the labels into several clusters, according to their frequency. These clusters may contain different number of targets each.

[1609.04309] Efficient softmax approximation for GPUs - arXiv

Our approach, called adaptive softmax, circumvents the linear dependency on the vocabulary size by exploiting the unbalanced word distribution ...

Adaptive Softmax explained in Numpy | Analytics Vidhya

So in Head , the 300 is the input dimension and the 3002 consists of two parts , 3000 is the number of...

Source code for fairseq.modules.adaptive_softmax

[docs]class AdaptiveSoftmax(nn.Module): """ This is an implementation of the efficient softmax approximation for graphical processing units (GPU), ...