Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

What is the function of sample captions and captions per image

See original GitHub issue

I tried to understand the code but

I don’t know what is the function of captions per image and sample captions and why we need that?

Utility.py

    # Sample captions for each image, save images to HDF5 file, and captions and their lengths to JSON files
    seed(123)
    for impaths, imcaps, split in [(train_image_paths, train_image_captions, 'TRAIN'),
                                   (val_image_paths, val_image_captions, 'VAL'),
                                   (test_image_paths, test_image_captions, 'TEST')]:

        with h5py.File(os.path.join(output_folder, split + '_IMAGES_' + base_filename + '.hdf5'), 'a') as h:
            # Make a note of the number of captions we are sampling per image
            h.attrs['captions_per_image'] = captions_per_image

            # Create dataset inside HDF5 file to store images
            images = h.create_dataset('images', (len(impaths), 3, 256, 256), dtype='uint8')

            print("\nReading %s images and captions, storing to file...\n" % split)

            enc_captions = []
            caplens = []

            for i, path in enumerate(tqdm(impaths)):

                # Sample captions
                if len(imcaps[i]) < captions_per_image:
                    captions = imcaps[i] + [choice(imcaps[i]) for _ in range(captions_per_image - len(imcaps[i]))]
                else:
                    captions = sample(imcaps[i], k=captions_per_image)

                # Sanity check
                assert len(captions) == captions_per_image

                # Read images
                img = imread(impaths[i])
                if len(img.shape) == 2:
                    img = img[:, :, np.newaxis]
                    img = np.concatenate([img, img, img], axis=2)
                img = imresize(img, (256, 256))
                img = img.transpose(2, 0, 1)
                assert img.shape == (3, 256, 256)
                assert np.max(img) <= 255

                # Save image to HDF5 file
                images[i] = img

                for j, c in enumerate(captions):
                    # Encode captions
                    enc_c = [word_map['<start>']] + [word_map.get(word, word_map['<unk>']) for word in c] + [
                        word_map['<end>']] + [word_map['<pad>']] * (max_len - len(c))

                    # Find caption lengths
                    c_len = len(c) + 2

                    enc_captions.append(enc_c)
                    caplens.append(c_len)

            # Sanity check
            assert images.shape[0] * captions_per_image == len(enc_captions) == len(caplens)

            # Save encoded captions and their lengths to JSON files
            with open(os.path.join(output_folder, split + '_CAPTIONS_' + base_filename + '.json'), 'w') as j:
                json.dump(enc_captions, j)

            with open(os.path.join(output_folder, split + '_CAPLENS_' + base_filename + '.json'), 'w') as j:
                json.dump(caplens, j)

Datasets.py

# cpi is captions per image

def __getitem__(self, i):
        # Remember, the Nth caption corresponds to the (N // captions_per_image)th image
        img = torch.FloatTensor(self.imgs[i // self.cpi] / 255.)
        if self.transform is not None:
            img = self.transform(img)

        caption = torch.LongTensor(self.captions[i])

        caplen = torch.LongTensor([self.caplens[i]])

        if self.split is 'TRAIN':
            return img, caption, caplen
        else:
            # For validation of testing, also return all 'captions_per_image' captions to find BLEU-4 score
            all_captions = torch.LongTensor(
                self.captions[((i // self.cpi) * self.cpi):(((i // self.cpi) * self.cpi) + self.cpi)])
            return img, caption, caplen, all_captions

Issue Analytics

State:
Created 5 years ago
Comments:6 (4 by maintainers)

Top GitHub Comments

1reaction

laptopmutiacommented, Sep 28, 2018

after discussing and reading your explanation I open the karpathy json files and beautify the json I just realize one image could have 5 captions

my mistake is misunderstanding about the word “captions” (english is not my native tongue) I assume the “captions” is the “token” so when there is a image with this captions A boy is sitting on a brightly colored chair next to a child 's book playing a guitar

I thought the captions per image will limit it to A boy is sitting on

karpathy _json_file.json

{
    "sentids": [14420, 14421, 14422, 14423, 14424],
    "imgid": 2884,
    "sentences": [{
      "tokens": ["a", "boy", "is", "sitting", "on", "a", "brightly", "colored", "chair", "next", "to", "a", "child", "s", "book", "playing", "a", "guitar"],
      "raw": "A boy is sitting on a brightly colored chair next to a child 's book playing a guitar .",
      "imgid": 2884,
      "sentid": 14420
    }, {
      "tokens": ["a", "toddler", "is", "sitting", "on", "a", "red", "checked", "chair", "whilst", "playing", "a", "guitar"],
      "raw": "A toddler is sitting on a red checked chair whilst playing a guitar .",
      "imgid": 2884,
      "sentid": 14421
    }, {
      "tokens": ["a", "young", "child", "is", "sitting", "in", "a", "colorful", "chair", "with", "a", "guitar", "in", "his", "hands", "and", "a", "book", "sitting", "next", "to", "him"],
      "raw": "A young child is sitting in a colorful chair with a guitar in his hands and a book sitting next to him .",
      "imgid": 2884,
      "sentid": 14422
    }, {
      "tokens": ["a", "young", "child", "playing", "guitar", "on", "the", "chair"],
      "raw": "A young child playing guitar on the chair",
      "imgid": 2884,
      "sentid": 14423
    }, {
      "tokens": ["young", "barefoot", "boy", "playing", "a", "guitar", "setting", "in", "a", "large", "muti", "colored", "chair", "with", "a", "picture", "of", "three", "pigs", "beside", "him"],
      "raw": "Young barefoot boy playing a guitar setting in a large muti colored chair with a picture of three pigs beside him .",
      "imgid": 2884,
      "sentid": 14424
    }],
    "split": "train",
    "filename": "2201222219_8d656b0633.jpg"
  }

thank you so much @sgrvinod for your explanation, quick reply and this awesome tutorial

1reaction

sgrvinodcommented, Sep 28, 2018

The resulting files do contain the sampled captions, but they are not grouped in 5s - they are flattened lists.

So, if you have 100 images, and you sample 5 captions per image, there are a total of 500 captions, right?

Then,

TRAIN_CAPLENS_coco_5_cap_per_img_5_min_word_freq.json contains the true word lengths of the 500 captions.

TRAIN_CAPTIONS_coco_5_cap_per_img_5_min_word_freq.json contains the 500 captions, where each caption is a list of encoded numbers.

So both these lists have 500 elements. The first 5 elements in both files correspond to Image 1, the second 5 elements correspond to Image 2, etc.

The images are stored in the HDF5 file, and there are a 100 of them. They’re in the same order as the captions. For example, the 143rd element from the first two files will correspond to the 143 // 5 = 28th image from this file.

btw nice jakiro avatar I also play dota

Thanks! It’s great to hear you play too. He’s my favorite hero.

Top Results From Across the Web

Image Caption: Definition & Importance - StudySmarter

Captioning your image is essential for four main reasons: to clarify your image, to enhance your image, to cite your image, and to...

A Guide to Image Captioning - Towards Data Science

Image Captioning is the process of generating a textual description for given images. It has been a very important and fundamental task in ......

An Overview of Image Caption Generation Methods - Hindawi

Image caption, automatically generating natural language descriptions according to the content observed in an image, is an important part of scene understanding ...

Images for Designers and Art Researchers - LibGuides

A caption appears next to the image and identifies or describes the image, and credits the source. There is no standard format for...

Writing photo captions | International Journalists' Network

A photo caption should provide the reader basic information needed to understand a photograph and its relevance to the news.