question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Strage behavior potententially bug.

See original GitHub issue

I am trying to adopt this repository to OCR task and facing same dilemma

While training you have 3 different sizes of image encoded in dataset

  1. Actual tensor size
  2. filed ‘size’ - which means what?
  3. field ‘orig_size’ - which I believe means original size of an image in the dataset

So if you try to print boxes of a dataset of an element for batch size bigger than 1 (I check it with 5) You will get behaviour there for the same picture due to random batch sampling will have different box coordinates

Look below. This function will print you boxes on image correctly but only in case of batch_size=1 or if all pictures in your dataset are the same size or in case if you use for W, H scaling from target[“size”] which is wrong.

# img = (3, H, W) tensor from batch with the samme H and W
# target - labels for this particular image
def showImageFromBatch(img, target):
    from PIL import Image, ImageDraw, ImageFont

    draw = ImageDraw.Draw(img)
    boxes = target['boxes']
    cl = target['labels']

    if 1:#boxes.max() <= 1:
        boxes = box_cxcywh_to_xyxy(boxes)

        print('Image:', (img.height, img.width), target['size'], target['orig_size'])


        H, W = target['size'] <<< Works well only with that
        W, H = img.width, img.height <<< But must works with this!!!

        boxes[:, 0::2] *= W
        boxes[:, 1::2] *= H

    for i in range(len(boxes)):
        x1, y1, x2, y2 = boxes[i]
        draw.rectangle((x1, y1, x2, y2), outline=(0, 255, 0) if cl[i] >= 0 else (0, 0, 0), width=3)
        draw.text((x1, y1), str(cl[i].item()), (0, 255, 0) if cl[i] >= 0 else (0, 0, 0),
                  font=ImageFont.truetype("DejaVuSansMono.ttf", 20))
    img.show()

Please clarify this situation. Thank you in advance.

Issue Analytics

  • State:closed
  • Created 3 years ago
  • Comments:13 (5 by maintainers)

github_iconTop GitHub Comments

3reactions
MaratZakirovcommented, Jun 9, 2020

@fmassa Thank you that is exactly I was asking for! So you provide information about real image via mask

2reactions
fmassacommented, Jun 9, 2020

Hi again,

So, I think that the main thing that we need to take into account here is that a Transformer encoder is permutation-equivariant. This means that we can shuffle all the pixels in feature map, and the output of the encoder will be shuffled accordingly. In the same vein, the Transformer decoder is permutation-invariant wrt the feature maps that we feed, which means that the order in which we feed the input pixels doesn’t matter.

With that in mind, the only way the transformer can predict relative coordinates is by feeding the positional encoding. I’ve described in my previous post that the positional encoding takes care for the objects inside the image. But I didn’t describe what / how the mask is calculated, which I’m doing now.

If you look at https://github.com/facebookresearch/detr/blob/be9d447ea3208e91069510643f75dadb7e9d163d/util/misc.py#L283-L300, which is used in the collate_fn that we pass to the DataLoader, you’ll see that images of different sizes are padded so that they have the same size. But we also keep another tensor (named mask) around, which hold as information which pixels belong to the image and which ones are just padded and should not be taken into account.

This mask is used at a few places:

  • the positional encoding, in order to properly compute the coordinates inside the image without taking padding into account
  • the transformer, which does not take the padded regions into account

So while the feature maps from the CNN will indeed be different for different image sizes, the transformer will only look at the regions which does not correspond to padding. And for predicting the 0-1 relative coordinates of the boxes, it will also only look at the features inside the image.

This can be a bit hard to explain, but please let us know if this isn’t clear and we will try to explain it better.

Read more comments on GitHub >

github_iconTop Results From Across the Web

10 Strange And Mysterious Insect Behaviors - Listverse
10 Strange And Mysterious Insect Behaviors · 10Synchronous Fireflies · 9Processionary Caterpillars · 8Trap Building · 7Fungus Grooming · 6Electric ...
Read more >
Scientists find previously unknown jumping behavior in insects
A team of researchers has discovered a jumping behavior that is entirely new to insect larvae, and there is evidence that it is...
Read more >
Strange UI Automatic Navigation behaviour, potentially bug ...
So, we're using the Unity UI 'automatic' navigation for all menus because we populate the menus by code. I can't manually link stuff...
Read more >
13 Bizarre Bug Facts That Will Totally Freak You Out
13 Bizarre Bug Facts That Will Totally Freak You Out · These facts will make your skin crawl about as much as the...
Read more >
MYSTERY BITES: Insect and Non-Insect Causes | Entomology
Nearly everyone experiences what seem like bug bites from time to time. The irritation might be accompanied by welts, rash, itching, or perhaps...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found