question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Large DeepLift delta with BERT explicit softmax init

See original GitHub issue

Hi all,

Thanks for all your amazing work on captum!

Upon modifying a huggingface/transformers BERT model to explicitly initialise softmax in __init__ as per suggestion in https://github.com/pytorch/captum/issues/347#issuecomment-616864035 I see a massive increase in the magnitude of the DeepLift delta (delta goes from -1.9306 to -12386754. on the same inputs).

I appreciate that there are other issues with this model (e.g. hidden activations not being initialised). I’m not sure whether these play a part in the issue. I was hoping to isolate just the softmax in the first instance.

I have created a notebook to demonstrate the issue that uses a fork of the transformers repo. I’m not sure if this is the best way to share/demonstrate. Please let me know if there’s a more convenient method. https://colab.research.google.com/drive/1OB4kkTP4I6R9t4XtQFB6braL8cP83nX5?usp=sharing

It’s also maybe worth noting that the actual attributions, both before and after this softmax change, are quite misleading (especially in contrast to Integrated Gradients). Though not entirely unexpected given other issues mentioned.

Any advice that you could share would be appreciated!

Issue Analytics

  • State:open
  • Created 3 years ago
  • Reactions:1
  • Comments:13 (7 by maintainers)

github_iconTop GitHub Comments

1reaction
NarineKcommented, Dec 16, 2020

@lannelin, I tried to reproduce it on a smaller example:

class ReLUDeepLiftModel(nn.Module):
    def __init__(self) -> None:
        super().__init__()
        self.relu1 = nn.ReLU()
        self.relu2 = nn.ReLU()
        self.softmax =  torch.nn.Softmax(dim=-1)

    def forward(self, x1, x2, x3=2):
        return self.softmax(2 * self.relu1(x1) + x3 * self.relu2(x2 - 1.5))


x1 = torch.tensor([[1.0, 1.0, 0.0]], requires_grad=True)
x2 = torch.tensor([[2.0, 2.0, -1.0]], requires_grad=True)

b1 = torch.tensor([[0.0, 0.0, 0.0]], requires_grad=True)
b2 = torch.tensor([[0.0, 0.0, 0.0]], requires_grad=True)

inputs = (x1, x2)
baselines = (b1, b2)

model = ReLUDeepLiftModel()

dl = DeepLift(model)
attr = dl.attribute(inputs=(x1, x2), target=0)

model(x1, x2)[:, 0] - model(b1,b1)[:, 0], attr[0].sum() + attr[1].sum()

summation to delta seems to work only if I use nonlinear rule for softmax.

1reaction
lannelincommented, Dec 8, 2020

The delta is much lower using the model described above and the normalization method suggested in your comment @NarineK (https://github.com/pytorch/captum/issues/519#issuecomment-738580948); delta is -8.5291 and was previously 1294020. with textattack/bert-base-uncased-imdb model that has 12 hidden layers. The attributions also make much more sense and have a similar ranking to IntegratedGradients.

I also tried training a model in identical fashion but using 3 hidden layers and the delta increases. I suspect that the issue is magnified at each hidden layer that is passes through - probably due to the softmax?

I’ve made the 1-hidden-layer model available here if that’s useful.

Read more comments on GitHub >

github_iconTop Results From Across the Web

DeepLift BERT.ipynb - Colaboratory - Google Colab
Issue: DeepLift delta magnitude becomes large when softmax explicitly initialised in huggingface/transformers BERT model.
Read more >
Softmax and Cross Entropy Loss - DeepNotes
Lets dig a little deep into how we convert the output of our CNN into probability - Softmax; and the loss measure to...
Read more >
understanding and enhancing mixed sample data augmentation ...
Many times it is possible to feel the network “struggle” to fit your data if it wiggles too much in some way, revealing...
Read more >
Softmax Function Definition | DeepAI
The softmax function is a function that turns a vector of K real values into a ... the softmax turns it into a...
Read more >
https://repositorio-aberto.up.pt/bitstream/10216/1...
The initial facial recognition systems are easily fooled by simple techniques, such as presenting a picture of the user, making it a big...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found