OWL-ViT outputs are offset for non-square images
See original GitHub issueSystem Info
transformers
version: 4.21.1- Platform: Linux-5.10.43.3-microsoft-standard-WSL2-x86_64-with-glibc2.29
- Python version: 3.8.10
- Huggingface_hub version: 0.8.1
- PyTorch version (GPU?): 1.12.1+cu102 (True)
- Tensorflow version (GPU?): not installed (NA)
- Flax version (CPU?/GPU?/TPU?): not installed (NA)
- Jax version: not installed
- JaxLib version: not installed
- Using GPU in script?: No
- Using distributed or parallel set-up in script?: NA
Who can help?
@alaradirik @sgugger @NielsRogge
Information
- The official example scripts
- My own modified scripts
Tasks
- An officially supported task in the
examples
folder (such as GLUE/SQuAD, …) - My own task or dataset (give details below)
Reproduction
Using the code snippet for OWL-ViT on a large Unsplash image (https://images.unsplash.com/photo-1517448922956-1efc1c6cc09c) gives an incorrect result. The bounding boxes seem offset. When cropping the image, the result is actually correct.
import requests
from PIL import Image
import torch
from transformers import OwlViTProcessor, OwlViTForObjectDetection
processor = OwlViTProcessor.from_pretrained("google/owlvit-base-patch32")
model = OwlViTForObjectDetection.from_pretrained("google/owlvit-base-patch32")
url = "https://images.unsplash.com/photo-1517448922956-1efc1c6cc09c"
image = Image.open(requests.get(url, stream=True).raw)
texts = [["flag", "car", "person", "sidewalk", "bicycle"]]
inputs = processor(text=texts, images=image, return_tensors="pt")
outputs = model(**inputs)
# Target image sizes (height, width) to rescale box predictions [batch_size, 2]
target_sizes = torch.Tensor([image.size[::-1]])
# Convert outputs (bounding boxes and class logits) to COCO API
results = processor.post_process(outputs=outputs, target_sizes=target_sizes)
i = 0 # Retrieve predictions for the first image for the corresponding text queries
text = texts[i]
boxes, scores, labels = results[i]["boxes"], results[i]["scores"], results[i]["labels"]
This is the result: note that the yellow flag is detected, but the bounding box is offset.
Expected behavior
The post_process()
method should correctly rescale the bounding boxes to the original image size.
See the Spaces demo (which uses cropping), which shows the flag detection at the right position.
Issue Analytics
- State:
- Created a year ago
- Comments:6 (3 by maintainers)
Top Results From Across the Web
OWL-ViT - Hugging Face
In this paper, we propose a strong recipe for transferring image-text models to open-vocabulary object detection. We use a standard Vision Transformer ...
Read more >Simple Open-Vocabulary Object Detection with Vision ... - arXiv
Vision Transformer for Open-World Localization, or OWL-ViT for short. ... ization heads directly to the image encoder output tokens. ... Non-square images ......
Read more >Add image-guided object detection support to OWL-ViT #18748
Hi, The OWL-ViT model is an open-vocabulary model that can be used for both zero-shot text-guided (supported) and one-shot image-guided (not ...
Read more >(PDF) Simple Open-Vocabulary Object Detection with Vision ...
Non -square images are padded at the bottom and right (gray color). ... Vision Transformer for Open-World Localization, or OWL-ViT for short.
Read more >How the Vision Transformer (ViT) works in 10 minutes
In this article you will learn how the vision transformer works for image classification problems. We distill all the important details you ...
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
@segments-tobias @cceyda thank you both for your input! The issue was due to defining the size as a single value instead of a tuple (768 instead of (768, 768)) in
OwlViTFeatureExtractor
. This led to the image/s getting resized along only one dimension and getting cropped along the other dimension later on in the preprocessing pipeline.The configuration files are updated and the
OwlViTProcessor
can correctly resize the input images now. I’ll open another PR to update the default values inOwlViTFeatureExtractor
but I’m closing this issue as it is fixed.Hi @segments-tobias, thank for opening the PR! @cceyda’s PR fixed the demo and I confirmed that the
post_process()
method works fine. The following code prints the boundary boxes correctly:I think there is an issue in
OwlViTFeatureExtractor
as omitting the manual resizing line causes unexpected outputs. I’ll double check this and open a fix PR shortly.