Loss Calculation in Image Captioning
See original GitHub issueHi , am new to PyTorch in model.py you do ,
embeddings = self.embed(captions)
embeddings = torch.cat((features.unsqueeze(1), embeddings), 1)
that means you are only passing the context(from the EncoderCNN ) as input to the first time step, hence your outputs will always be 1+captions size as you are running the lstm for captions+1 number of time steps in the decoder , but looks like you have targets and outputs of same size. What am I missing here ? Thanks in advance .
Issue Analytics
- State:
- Created 6 years ago
- Reactions:1
- Comments:7
Top Results From Across the Web
Image Captioning in Deep Learning | by Pranoy Radhakrishnan
Image Captioning is the process of generating textual description of an image. It uses both Natural Language Processing and Computer Vision to generate...
Read more >Image Captioning - Keras
Description: Implement an image captioning model using a CNN and a ... and compute the loss as well as accuracy # for each...
Read more >Image captioning with visual attention | TensorFlow Core
Use adapt to iterate over all captions, split the captions into words, and compute a vocabulary of the top words. Tokenize all captions...
Read more >Contrastive Learning for Image Captioning - NIPS papers
A majority of image captioning models are learned by Maximum Likelihood Estimation (MLE), where the probabilities of training captions conditioned on ...
Read more >Image Caption Generation with Recursive Neural Networks
word xt and the hidden state ht−1 is used to compute the input image vector wt. Figure 3: Depiction of the 14x14x512 CNN...
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
Decoder should have the same number of elements in both its input and output. Input in your example contains 7 words, whereas the output contains 5 words. Every word (including feature) is used to predict the next word, so it is like the following.
Input:
feature <start> there is a cat
Output:
<start> there is a cat <end>
Hey guys, The reason behind concatenating the feature vector with the input (here) is not clear for me. Would you please help me figure out the difference between this implementation and when we pass the feature vector as the hidden value of the first LSTM cell, as depicted below:
) 2) LSTM(input, hidden=image_feature) [link](
)