Add image-guided object detection support to OWL-ViT
See original GitHub issueHi,
The OWL-ViT model is an open-vocabulary model that can be used for both zero-shot text-guided (supported) and one-shot image-guided (not supported) object detection.
It’d be great to add support for one-shot object detection to OwlViTForObjectDetection
such that users can query images with an image of the target object instead of using text queries - e.g. using an image of a butterfly to search for all butterfly instances in the target image. See an example below.
To do this, we would just need to compute and use the OwlViTModel
(alias to CLIP) embeddings of the query images instead of the text query embeddings within OwlViTForObjectDetection.forward()
, which would take the target image + either text queries or image queries as input. Similarly, OwlViTProcessor
would be updated to preprocess sets of (image, text) and (image, query_image).
@sgugger @NielsRogge @amyeroberts @LysandreJik what do you think about this? Would this be something we would like to support?
Issue Analytics
- State:
- Created a year ago
- Comments:10 (10 by maintainers)
Top GitHub Comments
sure, will do, thanks for informing!
Hi @amyeroberts @alaradirik, I’m happy to take this up!