question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Loss Calculation in Image Captioning

See original GitHub issue

Hi , am new to PyTorch in model.py you do ,

embeddings = self.embed(captions)
embeddings = torch.cat((features.unsqueeze(1), embeddings), 1)

that means you are only passing the context(from the EncoderCNN ) as input to the first time step, hence your outputs will always be 1+captions size as you are running the lstm for captions+1 number of time steps in the decoder , but looks like you have targets and outputs of same size. What am I missing here ? Thanks in advance .

Issue Analytics

  • State:closed
  • Created 6 years ago
  • Reactions:1
  • Comments:7

github_iconTop GitHub Comments

1reaction
okanlvcommented, Sep 26, 2018

Decoder should have the same number of elements in both its input and output. Input in your example contains 7 words, whereas the output contains 5 words. Every word (including feature) is used to predict the next word, so it is like the following.

Input: feature <start> there is a cat

Output: <start> there is a cat <end>

0reactions
fmehraliancommented, Apr 7, 2020

Hey guys, The reason behind concatenating the feature vector with the input (here) is not clear for me. Would you please help me figure out the difference between this implementation and when we pass the feature vector as the hidden value of the first LSTM cell, as depicted below:

  1. LSTM(cat(image_feature, input)) [link](

) 2) LSTM(input, hidden=image_feature) [link](

)

Read more comments on GitHub >

github_iconTop Results From Across the Web

Image Captioning in Deep Learning | by Pranoy Radhakrishnan
Image Captioning is the process of generating textual description of an image. It uses both Natural Language Processing and Computer Vision to generate...
Read more >
Image Captioning - Keras
Description: Implement an image captioning model using a CNN and a ... and compute the loss as well as accuracy # for each...
Read more >
Image captioning with visual attention | TensorFlow Core
Use adapt to iterate over all captions, split the captions into words, and compute a vocabulary of the top words. Tokenize all captions...
Read more >
Contrastive Learning for Image Captioning - NIPS papers
A majority of image captioning models are learned by Maximum Likelihood Estimation (MLE), where the probabilities of training captions conditioned on ...
Read more >
Image Caption Generation with Recursive Neural Networks
word xt and the hidden state ht−1 is used to compute the input image vector wt. Figure 3: Depiction of the 14x14x512 CNN...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found