Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

OWL-ViT outputs are offset for non-square images

See original GitHub issue

System Info

transformers version: 4.21.1
Platform: Linux-5.10.43.3-microsoft-standard-WSL2-x86_64-with-glibc2.29
Python version: 3.8.10
Huggingface_hub version: 0.8.1
PyTorch version (GPU?): 1.12.1+cu102 (True)
Tensorflow version (GPU?): not installed (NA)
Flax version (CPU?/GPU?/TPU?): not installed (NA)
Jax version: not installed
JaxLib version: not installed
Using GPU in script?: No
Using distributed or parallel set-up in script?: NA

Who can help?

@alaradirik @sgugger @NielsRogge

Information

The official example scripts
My own modified scripts

Tasks

An officially supported task in the examples folder (such as GLUE/SQuAD, …)
My own task or dataset (give details below)

Reproduction

Using the code snippet for OWL-ViT on a large Unsplash image (https://images.unsplash.com/photo-1517448922956-1efc1c6cc09c) gives an incorrect result. The bounding boxes seem offset. When cropping the image, the result is actually correct.

import requests
from PIL import Image
import torch

from transformers import OwlViTProcessor, OwlViTForObjectDetection

processor = OwlViTProcessor.from_pretrained("google/owlvit-base-patch32")
model = OwlViTForObjectDetection.from_pretrained("google/owlvit-base-patch32")

url = "https://images.unsplash.com/photo-1517448922956-1efc1c6cc09c"
image = Image.open(requests.get(url, stream=True).raw)
texts = [["flag", "car", "person", "sidewalk", "bicycle"]]
inputs = processor(text=texts, images=image, return_tensors="pt")
outputs = model(**inputs)

# Target image sizes (height, width) to rescale box predictions [batch_size, 2]
target_sizes = torch.Tensor([image.size[::-1]])
# Convert outputs (bounding boxes and class logits) to COCO API
results = processor.post_process(outputs=outputs, target_sizes=target_sizes)

i = 0  # Retrieve predictions for the first image for the corresponding text queries
text = texts[i]
boxes, scores, labels = results[i]["boxes"], results[i]["scores"], results[i]["labels"]

This is the result: note that the yellow flag is detected, but the bounding box is offset.

Expected behavior

The post_process() method should correctly rescale the bounding boxes to the original image size. See the Spaces demo (which uses cropping), which shows the flag detection at the right position.

Issue Analytics

State:
Created a year ago
Comments:6 (3 by maintainers)

Top GitHub Comments

2reactions

alaradirikcommented, Aug 11, 2022

@segments-tobias @cceyda thank you both for your input! The issue was due to defining the size as a single value instead of a tuple (768 instead of (768, 768)) in OwlViTFeatureExtractor. This led to the image/s getting resized along only one dimension and getting cropped along the other dimension later on in the preprocessing pipeline.

The configuration files are updated and the OwlViTProcessor can correctly resize the input images now. I’ll open another PR to update the default values in OwlViTFeatureExtractor but I’m closing this issue as it is fixed.

1reaction

alaradirikcommented, Aug 10, 2022

Hi @segments-tobias, thank for opening the PR! @cceyda’s PR fixed the demo and I confirmed that thepost_process() method works fine. The following code prints the boundary boxes correctly:

import cv2
import numpy as np
import torch

from urllib.request import urlopen
from transformers import OwlViTProcessor, OwlViTForObjectDetection

processor = OwlViTProcessor.from_pretrained("google/owlvit-base-patch32")
model = OwlViTForObjectDetection.from_pretrained("google/owlvit-base-patch32")

# Download image
url = "https://images.unsplash.com/photo-1517448922956-1efc1c6cc09c"
array = np.asarray(bytearray(urlopen(url).read()), dtype=np.uint8)
image = cv2.cvtColor(cv2.imdecode(arr, -1), cv2.COLOR_BGR2RGB)

# Text queries
texts = [["flag", "car", "person", "sidewalk", "bicycle"]]

# Target image sizes (height, width) to rescale box predictions [batch_size, 2]
target_sizes = torch.Tensor([image.shape[:2]])
img_input = cv2.resize(image, (768, 768), interpolation = cv2.INTER_AREA)
inputs = processor(text=texts, images=img_input, return_tensors="pt")

with torch.no_grad():
    outputs = model(**inputs)

# Convert outputs (bounding boxes and class logits) to COCO API
results = processor.post_process(outputs=outputs, target_sizes=target_sizes)

i = 0  # Retrieve predictions for the first image for the corresponding text queries
text = texts[i]
boxes, scores, labels = results[i]["boxes"], results[i]["scores"], results[i]["labels"]

font = cv2.FONT_HERSHEY_SIMPLEX
score_threshold = 0.05

for box, score, label in zip(boxes, scores, labels):
    box = [int(i) for i in box.tolist()]

    if score >= score_threshold:
        image = cv2.rectangle(image, box[:2], box[2:], (255,0,0), 5)
        if box[3] + 25 > 768:
            y = box[3] - 10
        else:
            y = box[3] + 25

        image = cv2.putText(
            image, text[label], (box[0], y), font, 1, (255,0,0), 2, cv2.LINE_AA
        )

I think there is an issue in OwlViTFeatureExtractor as omitting the manual resizing line causes unexpected outputs. I’ll double check this and open a fix PR shortly.

Top Results From Across the Web

OWL-ViT - Hugging Face

In this paper, we propose a strong recipe for transferring image-text models to open-vocabulary object detection. We use a standard Vision Transformer ...

Simple Open-Vocabulary Object Detection with Vision ... - arXiv

Vision Transformer for Open-World Localization, or OWL-ViT for short. ... ization heads directly to the image encoder output tokens. ... Non-square images ......

Add image-guided object detection support to OWL-ViT #18748

Hi, The OWL-ViT model is an open-vocabulary model that can be used for both zero-shot text-guided (supported) and one-shot image-guided (not ...

(PDF) Simple Open-Vocabulary Object Detection with Vision ...

Non -square images are padded at the bottom and right (gray color). ... Vision Transformer for Open-World Localization, or OWL-ViT for short.

How the Vision Transformer (ViT) works in 10 minutes

In this article you will learn how the vision transformer works for image classification problems. We distill all the important details you ...