Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Add image-guided object detection support to OWL-ViT

See original GitHub issue

Hi,

The OWL-ViT model is an open-vocabulary model that can be used for both zero-shot text-guided (supported) and one-shot image-guided (not supported) object detection.

It’d be great to add support for one-shot object detection to OwlViTForObjectDetection such that users can query images with an image of the target object instead of using text queries - e.g. using an image of a butterfly to search for all butterfly instances in the target image. See an example below.

To do this, we would just need to compute and use the OwlViTModel (alias to CLIP) embeddings of the query images instead of the text query embeddings within OwlViTForObjectDetection.forward(), which would take the target image + either text queries or image queries as input. Similarly, OwlViTProcessor would be updated to preprocess sets of (image, text) and (image, query_image).

@sgugger @NielsRogge @amyeroberts @LysandreJik what do you think about this? Would this be something we would like to support?

Issue Analytics

State:
Created a year ago
Comments:10 (10 by maintainers)

Top GitHub Comments

1reaction

unographycommented, Sep 1, 2022

sure, will do, thanks for informing!

1reaction

unographycommented, Aug 25, 2022

Hi @amyeroberts @alaradirik, I’m happy to take this up!

Top Results From Across the Web

Image-Guided OWL-ViT Demo - a Hugging Face Space by adirik

Gradio demo for image-guided / one-shot object detection with OWL-ViT - OWL-ViT, introduced in Simple Open-Vocabulary Object Detection with Vision ...

Alara Dirik on Twitter: "Transformers now supports image ...

Transformers now supports image-guided object detection with OWL-ViT - find similar objects within an image using a query image of your ...

OWL-ViT-inference example.ipynb - Colaboratory

Given an image and one or multiple free-text queries, it finds objects matching the queries in the image. Unlike traditional object detection models,...

COCO Dataset | Papers With Code

object detection : bounding boxes and per-instance segmentation masks with 80 object categories,; captioning: natural language descriptions of the images (see MS ...

Niels Rogge's Post - LinkedIn

image -guided OWL-ViT: OWL-ViT is a model by Google that can do zero-shot object detection given text queries. Today we're extending this model...