What is the function of sample captions and captions per image
See original GitHub issueI tried to understand the code but
I don’t know what is the function of captions per image and sample captions and why we need that?
Utility.py
# Sample captions for each image, save images to HDF5 file, and captions and their lengths to JSON files
seed(123)
for impaths, imcaps, split in [(train_image_paths, train_image_captions, 'TRAIN'),
(val_image_paths, val_image_captions, 'VAL'),
(test_image_paths, test_image_captions, 'TEST')]:
with h5py.File(os.path.join(output_folder, split + '_IMAGES_' + base_filename + '.hdf5'), 'a') as h:
# Make a note of the number of captions we are sampling per image
h.attrs['captions_per_image'] = captions_per_image
# Create dataset inside HDF5 file to store images
images = h.create_dataset('images', (len(impaths), 3, 256, 256), dtype='uint8')
print("\nReading %s images and captions, storing to file...\n" % split)
enc_captions = []
caplens = []
for i, path in enumerate(tqdm(impaths)):
# Sample captions
if len(imcaps[i]) < captions_per_image:
captions = imcaps[i] + [choice(imcaps[i]) for _ in range(captions_per_image - len(imcaps[i]))]
else:
captions = sample(imcaps[i], k=captions_per_image)
# Sanity check
assert len(captions) == captions_per_image
# Read images
img = imread(impaths[i])
if len(img.shape) == 2:
img = img[:, :, np.newaxis]
img = np.concatenate([img, img, img], axis=2)
img = imresize(img, (256, 256))
img = img.transpose(2, 0, 1)
assert img.shape == (3, 256, 256)
assert np.max(img) <= 255
# Save image to HDF5 file
images[i] = img
for j, c in enumerate(captions):
# Encode captions
enc_c = [word_map['<start>']] + [word_map.get(word, word_map['<unk>']) for word in c] + [
word_map['<end>']] + [word_map['<pad>']] * (max_len - len(c))
# Find caption lengths
c_len = len(c) + 2
enc_captions.append(enc_c)
caplens.append(c_len)
# Sanity check
assert images.shape[0] * captions_per_image == len(enc_captions) == len(caplens)
# Save encoded captions and their lengths to JSON files
with open(os.path.join(output_folder, split + '_CAPTIONS_' + base_filename + '.json'), 'w') as j:
json.dump(enc_captions, j)
with open(os.path.join(output_folder, split + '_CAPLENS_' + base_filename + '.json'), 'w') as j:
json.dump(caplens, j)
Datasets.py
# cpi is captions per image
def __getitem__(self, i):
# Remember, the Nth caption corresponds to the (N // captions_per_image)th image
img = torch.FloatTensor(self.imgs[i // self.cpi] / 255.)
if self.transform is not None:
img = self.transform(img)
caption = torch.LongTensor(self.captions[i])
caplen = torch.LongTensor([self.caplens[i]])
if self.split is 'TRAIN':
return img, caption, caplen
else:
# For validation of testing, also return all 'captions_per_image' captions to find BLEU-4 score
all_captions = torch.LongTensor(
self.captions[((i // self.cpi) * self.cpi):(((i // self.cpi) * self.cpi) + self.cpi)])
return img, caption, caplen, all_captions
Issue Analytics
- State:
- Created 5 years ago
- Comments:6 (4 by maintainers)
Top Results From Across the Web
Image Caption: Definition & Importance - StudySmarter
Captioning your image is essential for four main reasons: to clarify your image, to enhance your image, to cite your image, and to...
Read more >A Guide to Image Captioning - Towards Data Science
Image Captioning is the process of generating a textual description for given images. It has been a very important and fundamental task in ......
Read more >An Overview of Image Caption Generation Methods - Hindawi
Image caption, automatically generating natural language descriptions according to the content observed in an image, is an important part of scene understanding ...
Read more >Images for Designers and Art Researchers - LibGuides
A caption appears next to the image and identifies or describes the image, and credits the source. There is no standard format for...
Read more >Writing photo captions | International Journalists' Network
A photo caption should provide the reader basic information needed to understand a photograph and its relevance to the news.
Read more >
Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free
Top Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
after discussing and reading your explanation I open the karpathy json files and beautify the json I just realize one image could have 5 captions
my mistake is misunderstanding about the word “captions” (english is not my native tongue) I assume the “captions” is the “token” so when there is a image with this captions
A boy is sitting on a brightly colored chair next to a child 's book playing a guitar
I thought the captions per image will limit it to
A boy is sitting on
karpathy _json_file.json
thank you so much @sgrvinod for your explanation, quick reply and this awesome tutorial
The resulting files do contain the sampled captions, but they are not grouped in 5s - they are flattened lists.
So, if you have 100 images, and you sample 5 captions per image, there are a total of 500 captions, right?
Then,
TRAIN_CAPLENS_coco_5_cap_per_img_5_min_word_freq.json
contains the true word lengths of the 500 captions.TRAIN_CAPTIONS_coco_5_cap_per_img_5_min_word_freq.json
contains the 500 captions, where each caption is a list of encoded numbers.So both these lists have 500 elements. The first 5 elements in both files correspond to Image 1, the second 5 elements correspond to Image 2, etc.
The images are stored in the HDF5 file, and there are a 100 of them. They’re in the same order as the captions. For example, the 143rd element from the first two files will correspond to the 143 // 5 = 28th image from this file.
Thanks! It’s great to hear you play too. He’s my favorite hero.