[Image captioning] pack_packed wrong lengths
See original GitHub issueIn the Image Captioning tutorial in the DecoderRNN:
def forward(self, features, captions, lengths):
embeddings = self.embed(captions)
embeddings = torch.cat((features.unsqueeze(1), embeddings), 1)
packed = pack_padded_sequence(embeddings, lengths, batch_first=True)
hiddens, _ = self.lstm(packed)
outputs = self.linear(hiddens[0])
return outputs
shouldn’t the lenghts inside pack_padded be lenghts+1 to take into account the increased length due to the features added with cat?
e.g. (assuming numbers are captions, e is the embedding, f are the features and batch_size = 4) if
embeds:
e(126) e(1214) e(14) e(4033)
e(126) e(6) e(84) e(4033)
e(126) e(3002) e(4033) e(0)
e(126) e(3002) e(4033) e(0)
has lengths = [4,4,3,3] then
embeds_cat:
f_0 e(126) e(1214) e(14) e(4033)
f_1 e(126) e(6) e(84) e(4033)
f_2 e(126) e(3002) e(4033) e(0)
f_3 e(126) e(3002) e(4033) e(0)
should have lengths = [5,5,4,4], right?
Issue Analytics
- State:
- Created 6 years ago
- Comments:5
Top Results From Across the Web
Loss Calculation in Image Captioning · Issue #101 - GitHub
Each caption will be embedded with their respective length. Packing "features <start> there is a cat <end>" with length 6 is the same...
Read more >[2005.14386] Controlling Length in Image Captioning - arXiv
Abstract: We develop and evaluate captioning models that allow control of caption length. Our models can leverage this control to generate ...
Read more >How to Develop a Deep Learning Photo Caption Generator ...
In this tutorial, you will discover how to develop a photo captioning deep learning model from scratch. After completing this tutorial, ...
Read more >Transformer Image captioning model produces just padding ...
I am trying to produce a model that will produce a caption for an image using resnet as the encoder, transformer as the...
Read more >Image Captioning with Keras. Table of Contents:
It must be noted that, one image+caption is not a single data point but are multiple data points depending on the length of...
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
After some more testing I’ve decided that the lengths shouldn’t be changed since I don’t want to use the last caption as an input to the lstm (the <end> token, 4033 in the example).
Hi LapoFrati, thanks for your reply. Somehow github eats my word <s.o.s> which is the starting of sentence token (in your case caption_0). I have modified the <s.o.s> in my original question. so you can see, features + zeros will always be used to predict caption_0 which is basically a starting token (similar to eos). Does it make sense?