Strage behavior potententially bug.
See original GitHub issueI am trying to adopt this repository to OCR task and facing same dilemma
While training you have 3 different sizes of image encoded in dataset
- Actual tensor size
- filed ‘size’ - which means what?
- field ‘orig_size’ - which I believe means original size of an image in the dataset
So if you try to print boxes of a dataset of an element for batch size bigger than 1 (I check it with 5) You will get behaviour there for the same picture due to random batch sampling will have different box coordinates
Look below. This function will print you boxes on image correctly but only in case of batch_size=1 or if all pictures in your dataset are the same size or in case if you use for W, H scaling from target[“size”] which is wrong.
# img = (3, H, W) tensor from batch with the samme H and W
# target - labels for this particular image
def showImageFromBatch(img, target):
from PIL import Image, ImageDraw, ImageFont
draw = ImageDraw.Draw(img)
boxes = target['boxes']
cl = target['labels']
if 1:#boxes.max() <= 1:
boxes = box_cxcywh_to_xyxy(boxes)
print('Image:', (img.height, img.width), target['size'], target['orig_size'])
H, W = target['size'] <<< Works well only with that
W, H = img.width, img.height <<< But must works with this!!!
boxes[:, 0::2] *= W
boxes[:, 1::2] *= H
for i in range(len(boxes)):
x1, y1, x2, y2 = boxes[i]
draw.rectangle((x1, y1, x2, y2), outline=(0, 255, 0) if cl[i] >= 0 else (0, 0, 0), width=3)
draw.text((x1, y1), str(cl[i].item()), (0, 255, 0) if cl[i] >= 0 else (0, 0, 0),
font=ImageFont.truetype("DejaVuSansMono.ttf", 20))
img.show()
Please clarify this situation. Thank you in advance.
Issue Analytics
- State:
- Created 3 years ago
- Comments:13 (5 by maintainers)
Top Results From Across the Web
10 Strange And Mysterious Insect Behaviors - Listverse
10 Strange And Mysterious Insect Behaviors · 10Synchronous Fireflies · 9Processionary Caterpillars · 8Trap Building · 7Fungus Grooming · 6Electric ...
Read more >Scientists find previously unknown jumping behavior in insects
A team of researchers has discovered a jumping behavior that is entirely new to insect larvae, and there is evidence that it is...
Read more >Strange UI Automatic Navigation behaviour, potentially bug ...
So, we're using the Unity UI 'automatic' navigation for all menus because we populate the menus by code. I can't manually link stuff...
Read more >13 Bizarre Bug Facts That Will Totally Freak You Out
13 Bizarre Bug Facts That Will Totally Freak You Out · These facts will make your skin crawl about as much as the...
Read more >MYSTERY BITES: Insect and Non-Insect Causes | Entomology
Nearly everyone experiences what seem like bug bites from time to time. The irritation might be accompanied by welts, rash, itching, or perhaps...
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
@fmassa Thank you that is exactly I was asking for! So you provide information about real image via mask
Hi again,
So, I think that the main thing that we need to take into account here is that a Transformer encoder is permutation-equivariant. This means that we can shuffle all the pixels in feature map, and the output of the encoder will be shuffled accordingly. In the same vein, the Transformer decoder is permutation-invariant wrt the feature maps that we feed, which means that the order in which we feed the input pixels doesn’t matter.
With that in mind, the only way the transformer can predict relative coordinates is by feeding the positional encoding. I’ve described in my previous post that the positional encoding takes care for the objects inside the image. But I didn’t describe what / how the mask is calculated, which I’m doing now.
If you look at https://github.com/facebookresearch/detr/blob/be9d447ea3208e91069510643f75dadb7e9d163d/util/misc.py#L283-L300, which is used in the
collate_fn
that we pass to theDataLoader
, you’ll see that images of different sizes are padded so that they have the same size. But we also keep another tensor (namedmask
) around, which hold as information which pixels belong to the image and which ones are just padded and should not be taken into account.This
mask
is used at a few places:So while the feature maps from the CNN will indeed be different for different image sizes, the transformer will only look at the regions which does not correspond to padding. And for predicting the 0-1 relative coordinates of the boxes, it will also only look at the features inside the image.
This can be a bit hard to explain, but please let us know if this isn’t clear and we will try to explain it better.