question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

OWL-ViT outputs are offset for non-square images

See original GitHub issue

System Info

  • transformers version: 4.21.1
  • Platform: Linux-5.10.43.3-microsoft-standard-WSL2-x86_64-with-glibc2.29
  • Python version: 3.8.10
  • Huggingface_hub version: 0.8.1
  • PyTorch version (GPU?): 1.12.1+cu102 (True)
  • Tensorflow version (GPU?): not installed (NA)
  • Flax version (CPU?/GPU?/TPU?): not installed (NA)
  • Jax version: not installed
  • JaxLib version: not installed
  • Using GPU in script?: No
  • Using distributed or parallel set-up in script?: NA

Who can help?

@alaradirik @sgugger @NielsRogge

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • An officially supported task in the examples folder (such as GLUE/SQuAD, …)
  • My own task or dataset (give details below)

Reproduction

Using the code snippet for OWL-ViT on a large Unsplash image (https://images.unsplash.com/photo-1517448922956-1efc1c6cc09c) gives an incorrect result. The bounding boxes seem offset. When cropping the image, the result is actually correct.

import requests
from PIL import Image
import torch

from transformers import OwlViTProcessor, OwlViTForObjectDetection

processor = OwlViTProcessor.from_pretrained("google/owlvit-base-patch32")
model = OwlViTForObjectDetection.from_pretrained("google/owlvit-base-patch32")

url = "https://images.unsplash.com/photo-1517448922956-1efc1c6cc09c"
image = Image.open(requests.get(url, stream=True).raw)
texts = [["flag", "car", "person", "sidewalk", "bicycle"]]
inputs = processor(text=texts, images=image, return_tensors="pt")
outputs = model(**inputs)

# Target image sizes (height, width) to rescale box predictions [batch_size, 2]
target_sizes = torch.Tensor([image.size[::-1]])
# Convert outputs (bounding boxes and class logits) to COCO API
results = processor.post_process(outputs=outputs, target_sizes=target_sizes)

i = 0  # Retrieve predictions for the first image for the corresponding text queries
text = texts[i]
boxes, scores, labels = results[i]["boxes"], results[i]["scores"], results[i]["labels"]

This is the result: note that the yellow flag is detected, but the bounding box is offset. image

Expected behavior

The post_process() method should correctly rescale the bounding boxes to the original image size. See the Spaces demo (which uses cropping), which shows the flag detection at the right position. image

Issue Analytics

  • State:closed
  • Created a year ago
  • Comments:6 (3 by maintainers)

github_iconTop GitHub Comments

2reactions
alaradirikcommented, Aug 11, 2022

@segments-tobias @cceyda thank you both for your input! The issue was due to defining the size as a single value instead of a tuple (768 instead of (768, 768)) in OwlViTFeatureExtractor. This led to the image/s getting resized along only one dimension and getting cropped along the other dimension later on in the preprocessing pipeline.

The configuration files are updated and the OwlViTProcessor can correctly resize the input images now. I’ll open another PR to update the default values in OwlViTFeatureExtractor but I’m closing this issue as it is fixed.

1reaction
alaradirikcommented, Aug 10, 2022

Hi @segments-tobias, thank for opening the PR! @cceyda’s PR fixed the demo and I confirmed that thepost_process() method works fine. The following code prints the boundary boxes correctly:

import cv2
import numpy as np
import torch

from urllib.request import urlopen
from transformers import OwlViTProcessor, OwlViTForObjectDetection

processor = OwlViTProcessor.from_pretrained("google/owlvit-base-patch32")
model = OwlViTForObjectDetection.from_pretrained("google/owlvit-base-patch32")

# Download image
url = "https://images.unsplash.com/photo-1517448922956-1efc1c6cc09c"
array = np.asarray(bytearray(urlopen(url).read()), dtype=np.uint8)
image = cv2.cvtColor(cv2.imdecode(arr, -1), cv2.COLOR_BGR2RGB)

# Text queries
texts = [["flag", "car", "person", "sidewalk", "bicycle"]]

# Target image sizes (height, width) to rescale box predictions [batch_size, 2]
target_sizes = torch.Tensor([image.shape[:2]])
img_input = cv2.resize(image, (768, 768), interpolation = cv2.INTER_AREA)
inputs = processor(text=texts, images=img_input, return_tensors="pt")

with torch.no_grad():
    outputs = model(**inputs)

# Convert outputs (bounding boxes and class logits) to COCO API
results = processor.post_process(outputs=outputs, target_sizes=target_sizes)

i = 0  # Retrieve predictions for the first image for the corresponding text queries
text = texts[i]
boxes, scores, labels = results[i]["boxes"], results[i]["scores"], results[i]["labels"]

font = cv2.FONT_HERSHEY_SIMPLEX
score_threshold = 0.05

for box, score, label in zip(boxes, scores, labels):
    box = [int(i) for i in box.tolist()]

    if score >= score_threshold:
        image = cv2.rectangle(image, box[:2], box[2:], (255,0,0), 5)
        if box[3] + 25 > 768:
            y = box[3] - 10
        else:
            y = box[3] + 25

        image = cv2.putText(
            image, text[label], (box[0], y), font, 1, (255,0,0), 2, cv2.LINE_AA
        )

I think there is an issue in OwlViTFeatureExtractor as omitting the manual resizing line causes unexpected outputs. I’ll double check this and open a fix PR shortly.

Read more comments on GitHub >

github_iconTop Results From Across the Web

OWL-ViT - Hugging Face
In this paper, we propose a strong recipe for transferring image-text models to open-vocabulary object detection. We use a standard Vision Transformer ...
Read more >
Simple Open-Vocabulary Object Detection with Vision ... - arXiv
Vision Transformer for Open-World Localization, or OWL-ViT for short. ... ization heads directly to the image encoder output tokens. ... Non-square images ......
Read more >
Add image-guided object detection support to OWL-ViT #18748
Hi, The OWL-ViT model is an open-vocabulary model that can be used for both zero-shot text-guided (supported) and one-shot image-guided (not ...
Read more >
(PDF) Simple Open-Vocabulary Object Detection with Vision ...
Non -square images are padded at the bottom and right (gray color). ... Vision Transformer for Open-World Localization, or OWL-ViT for short.
Read more >
How the Vision Transformer (ViT) works in 10 minutes
In this article you will learn how the vision transformer works for image classification problems. We distill all the important details you ...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found