Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Large DeepLift delta with BERT explicit softmax init

See original GitHub issue

Hi all,

Thanks for all your amazing work on captum!

Upon modifying a huggingface/transformers BERT model to explicitly initialise softmax in __init__ as per suggestion in https://github.com/pytorch/captum/issues/347#issuecomment-616864035 I see a massive increase in the magnitude of the DeepLift delta (delta goes from -1.9306 to -12386754. on the same inputs).

I appreciate that there are other issues with this model (e.g. hidden activations not being initialised). I’m not sure whether these play a part in the issue. I was hoping to isolate just the softmax in the first instance.

I have created a notebook to demonstrate the issue that uses a fork of the transformers repo. I’m not sure if this is the best way to share/demonstrate. Please let me know if there’s a more convenient method. https://colab.research.google.com/drive/1OB4kkTP4I6R9t4XtQFB6braL8cP83nX5?usp=sharing

It’s also maybe worth noting that the actual attributions, both before and after this softmax change, are quite misleading (especially in contrast to Integrated Gradients). Though not entirely unexpected given other issues mentioned.

Any advice that you could share would be appreciated!

Issue Analytics

State:
Created 3 years ago
Reactions:1
Comments:13 (7 by maintainers)

Top GitHub Comments

1reaction

NarineKcommented, Dec 16, 2020

@lannelin, I tried to reproduce it on a smaller example:

class ReLUDeepLiftModel(nn.Module):
    def __init__(self) -> None:
        super().__init__()
        self.relu1 = nn.ReLU()
        self.relu2 = nn.ReLU()
        self.softmax =  torch.nn.Softmax(dim=-1)

    def forward(self, x1, x2, x3=2):
        return self.softmax(2 * self.relu1(x1) + x3 * self.relu2(x2 - 1.5))


x1 = torch.tensor([[1.0, 1.0, 0.0]], requires_grad=True)
x2 = torch.tensor([[2.0, 2.0, -1.0]], requires_grad=True)

b1 = torch.tensor([[0.0, 0.0, 0.0]], requires_grad=True)
b2 = torch.tensor([[0.0, 0.0, 0.0]], requires_grad=True)

inputs = (x1, x2)
baselines = (b1, b2)

model = ReLUDeepLiftModel()

dl = DeepLift(model)
attr = dl.attribute(inputs=(x1, x2), target=0)

model(x1, x2)[:, 0] - model(b1,b1)[:, 0], attr[0].sum() + attr[1].sum()

summation to delta seems to work only if I use nonlinear rule for softmax.

1reaction

lannelincommented, Dec 8, 2020

The delta is much lower using the model described above and the normalization method suggested in your comment @NarineK (https://github.com/pytorch/captum/issues/519#issuecomment-738580948); delta is -8.5291 and was previously 1294020. with textattack/bert-base-uncased-imdb model that has 12 hidden layers. The attributions also make much more sense and have a similar ranking to IntegratedGradients.

I also tried training a model in identical fashion but using 3 hidden layers and the delta increases. I suspect that the issue is magnified at each hidden layer that is passes through - probably due to the softmax?

I’ve made the 1-hidden-layer model available here if that’s useful.