Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Loss Calculation in Image Captioning

See original GitHub issue

Hi , am new to PyTorch in model.py you do ,

embeddings = self.embed(captions)
embeddings = torch.cat((features.unsqueeze(1), embeddings), 1)

that means you are only passing the context(from the EncoderCNN ) as input to the first time step, hence your outputs will always be 1+captions size as you are running the lstm for captions+1 number of time steps in the decoder , but looks like you have targets and outputs of same size. What am I missing here ? Thanks in advance .

Issue Analytics

State:
Created 6 years ago
Reactions:1
Comments:7

Top GitHub Comments

1reaction

okanlvcommented, Sep 26, 2018

Decoder should have the same number of elements in both its input and output. Input in your example contains 7 words, whereas the output contains 5 words. Every word (including feature) is used to predict the next word, so it is like the following.

Input: feature <start> there is a cat

Output: <start> there is a cat <end>

0reactions

fmehraliancommented, Apr 7, 2020

Hey guys, The reason behind concatenating the feature vector with the input (here) is not clear for me. Would you please help me figure out the difference between this implementation and when we pass the feature vector as the hidden value of the first LSTM cell, as depicted below:

LSTM(cat(image_feature, input)) [link](

) 2) LSTM(input, hidden=image_feature) [link](

)

Top Results From Across the Web

Image Captioning in Deep Learning | by Pranoy Radhakrishnan

Image Captioning is the process of generating textual description of an image. It uses both Natural Language Processing and Computer Vision to generate...

Image Captioning - Keras

Description: Implement an image captioning model using a CNN and a ... and compute the loss as well as accuracy # for each...

Image captioning with visual attention | TensorFlow Core

Use adapt to iterate over all captions, split the captions into words, and compute a vocabulary of the top words. Tokenize all captions...

Contrastive Learning for Image Captioning - NIPS papers

A majority of image captioning models are learned by Maximum Likelihood Estimation (MLE), where the probabilities of training captions conditioned on ...

Image Caption Generation with Recursive Neural Networks

word xt and the hidden state ht−1 is used to compute the input image vector wt. Figure 3: Depiction of the 14x14x512 CNN...