[Feature Proposal] Add image captioning example
See original GitHub issueHi fairseq team!
As mentioned in issue #90, #313 and #475 , there are plenty of places where vision and language intersect (e.g., image captioning, VQA). And I have written an image captioning example based on this excellent fairseq
toolkit, I want to know whether there is a plan to add an image captioning / text recognition example?
My implementation is in my text-recognition
branch, current structure is only CRNN with a CTCLoss
criterion.
My next plan is add attention module and transformer module to image captioning task based on fairseq’s official implementation modules.
Issue Analytics
- State:
- Created 4 years ago
- Reactions:1
- Comments:6 (2 by maintainers)
Top Results From Across the Web
[Feature Proposal] Add image captioning example #759
My implementation is in my text-recognition branch, current structure is only CRNN with a CTCLoss criterion. My next plan is add attention ...
Read more >Image captioning model using attention and object features to ...
Our model uses an attention-based Encoder-Decoder architecture. It has two methods of feature extraction for image captioning: an image ...
Read more >Image Captioning | Papers With Code
Image Captioning is the task of describing the content of an image in words. This task lies at the intersection of computer vision...
Read more >An Overview of Image Caption Generation Methods - Hindawi
Image caption models can be divided into two main categories: a method based on a statistical probability language model to generate handcraft features...
Read more >Put Action Captions To Work In Your Proposals
Today, we are going to tell you how to pair those images with appropriate action captions for a proposal that doubly wows.
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
@adrelino in fairseq-image-captioning you can already use pre-computed features extracted with a Faster-RCNN for training transformer-based captioning models by using the command line option
--features obj
. These are pre-computed for the MS-COCO dataset and split into Karpathy train, validation and test sets.At the moment, I only use these pre-computed features. A later version of fairseq-image-captioning will then use a Faster-RCNN directly and implementations from torchvision or detectron2 are good candidates. This will also allow fine-tuning the object detector together with the image captioning model (which will probably require a larger dataset than MS-COCO). Happy to collaborate on that or accept pull requests.
At the moment, I’m implementing Self-critical Sequence Training for Image Captioning and have already promising results. It took me a while to implement as I had to re-write the sequence generator so that it can also be used during training i.e. supports back-propagation (which is not supported by the default sequence generator in fairseq). Should be on Github soon. Update Feb 25, 2020: Self-critical sequence training now implemented.
Afterwards, I initially planned to implement M2: Meshed-Memory Transformer for Image Captioning which requires some extensions to the transformer implementation in fairseq but I’m also open to give a Faster-RCNN implementation a higher priority if you are interested in a collaboration.
Regarding
fairseq-image-captioning also supports feeding extracted features into a transformer encoder for self-attention on visual “tokens” and then feeding the encoder output into a transformer decoder. Using a transformer encoder can be enabled with the
--arch default-captioning-arch
command line option.Together with @cstub, I started to work on an image captioning extension for fairseq. Still early-access but you can already train transformer-based image captioning models. There’s also a simple demo and a pre-trained model available. More details in the project’s README.