Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

[Image captioning] pack_packed wrong lengths

See original GitHub issue

In the Image Captioning tutorial in the DecoderRNN:

def forward(self, features, captions, lengths):
  embeddings = self.embed(captions)
  embeddings = torch.cat((features.unsqueeze(1), embeddings), 1)
  packed = pack_padded_sequence(embeddings, lengths, batch_first=True)
  hiddens, _ = self.lstm(packed)
  outputs = self.linear(hiddens[0])
  return outputs

shouldn’t the lenghts inside pack_padded be lenghts+1 to take into account the increased length due to the features added with cat?

e.g. (assuming numbers are captions, e is the embedding, f are the features and batch_size = 4) if

embeds:
e(126)  e(1214)  e(14)    e(4033)
e(126)  e(6)     e(84)    e(4033)
e(126)  e(3002)  e(4033)  e(0)
e(126)  e(3002)  e(4033)  e(0)

has lengths = [4,4,3,3] then

embeds_cat:
f_0  e(126)  e(1214)  e(14)    e(4033)
f_1  e(126)  e(6)     e(84)    e(4033)
f_2  e(126)  e(3002)  e(4033)  e(0)
f_3  e(126)  e(3002)  e(4033)  e(0)

should have lengths = [5,5,4,4], right?

Issue Analytics

State:
Created 6 years ago
Comments:5

Top GitHub Comments

1reaction

lfraticommented, Dec 27, 2018

After some more testing I’ve decided that the lengths shouldn’t be changed since I don’t want to use the last caption as an input to the lstm (the <end> token, 4033 in the example).

0reactions

sth4kcommented, Dec 27, 2018

Using the “feature vector” you have a training process that looks like:

step 0: features + zeroes -> caption_0 + state_0

step 1: embed(caption_0) + state_0 -> caption_1 + state_1

…

step n: embed(caption_n-1) + state_n-1 -> <end>

where zeroes is the default value if no state is provided. @sth4k if you use the conv output as a “starting state” and manually provide it to the LSTM, what would you use as first input to the LSTM? A vector of constant values? I’m not an expert but I think both approaches should work in principle.

Hi LapoFrati, thanks for your reply. Somehow github eats my word <s.o.s> which is the starting of sentence token (in your case caption_0). I have modified the <s.o.s> in my original question. so you can see, features + zeros will always be used to predict caption_0 which is basically a starting token (similar to eos). Does it make sense?