question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

[Feature Proposal] Add image captioning example

See original GitHub issue

Hi fairseq team!

As mentioned in issue #90, #313 and #475 , there are plenty of places where vision and language intersect (e.g., image captioning, VQA). And I have written an image captioning example based on this excellent fairseq toolkit, I want to know whether there is a plan to add an image captioning / text recognition example?

My implementation is in my text-recognition branch, current structure is only CRNN with a CTCLoss criterion.

My next plan is add attention module and transformer module to image captioning task based on fairseq’s official implementation modules.

Issue Analytics

  • State:closed
  • Created 4 years ago
  • Reactions:1
  • Comments:6 (2 by maintainers)

github_iconTop GitHub Comments

4reactions
krassermcommented, Feb 25, 2020

@adrelino in fairseq-image-captioning you can already use pre-computed features extracted with a Faster-RCNN for training transformer-based captioning models by using the command line option --features obj. These are pre-computed for the MS-COCO dataset and split into Karpathy train, validation and test sets.

At the moment, I only use these pre-computed features. A later version of fairseq-image-captioning will then use a Faster-RCNN directly and implementations from torchvision or detectron2 are good candidates. This will also allow fine-tuning the object detector together with the image captioning model (which will probably require a larger dataset than MS-COCO). Happy to collaborate on that or accept pull requests.

At the moment, I’m implementing Self-critical Sequence Training for Image Captioning and have already promising results. It took me a while to implement as I had to re-write the sequence generator so that it can also be used during training i.e. supports back-propagation (which is not supported by the default sequence generator in fairseq). Should be on Github soon. Update Feb 25, 2020: Self-critical sequence training now implemented.

Afterwards, I initially planned to implement M2: Meshed-Memory Transformer for Image Captioning which requires some extensions to the transformer implementation in fairseq but I’m also open to give a Faster-RCNN implementation a higher priority if you are interested in a collaboration.

Regarding

This is then fed into fairseq’s LSTM/Transformer -based decoder to generate the captions.

fairseq-image-captioning also supports feeding extracted features into a transformer encoder for self-attention on visual “tokens” and then feeding the encoder output into a transformer decoder. Using a transformer encoder can be enabled with the --arch default-captioning-arch command line option.

3reactions
krassermcommented, Nov 28, 2019

Together with @cstub, I started to work on an image captioning extension for fairseq. Still early-access but you can already train transformer-based image captioning models. There’s also a simple demo and a pre-trained model available. More details in the project’s README.

Read more comments on GitHub >

github_iconTop Results From Across the Web

[Feature Proposal] Add image captioning example #759
My implementation is in my text-recognition branch, current structure is only CRNN with a CTCLoss criterion. My next plan is add attention ...
Read more >
Image captioning model using attention and object features to ...
Our model uses an attention-based Encoder-Decoder architecture. It has two methods of feature extraction for image captioning: an image ...
Read more >
Image Captioning | Papers With Code
Image Captioning is the task of describing the content of an image in words. This task lies at the intersection of computer vision...
Read more >
An Overview of Image Caption Generation Methods - Hindawi
Image caption models can be divided into two main categories: a method based on a statistical probability language model to generate handcraft features...
Read more >
Put Action Captions To Work In Your Proposals
Today, we are going to tell you how to pair those images with appropriate action captions for a proposal that doubly wows.
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found