[Discussion] How do we want to handle `torchvision.prototype.features.Feature`'s?
See original GitHub issueThis issue should spark a discussion about how we want to handle Feature
’s in the future. There are a lot of open questions I’m trying to summarize. I’ll give my opinion to each of them. You can find the current implementation under torchvision.prototype.features
.
What are Feature
’s?
Feature
’s are subclasses of torch.Tensor
and their purpose is threefold:
- With their type, e.g.
Image
, they information about the data they carry. The prototype transformations (torchvision.prototype.transforms
) use this information to automatically dispatch an input to the correct kernel. - They can optionally carry additional meta data that might be needed for transforming the feature. For example, most geometric transformations can only be performed on bounding boxes if the size of the corresponding image is known.
- They provide a convenient interface for feature specific functionality, for example transforming the format of a bounding box.
There are currently three Feature
’s implemented
Image
,BoundingBox
, andLabel
,
but in the future we should add at least three more:
SemanticSegmentationMask
,InstanceSegementationMask
, andVideo
.
What is the policy of adding new Feature
’s?
We could allow subclassing of Feature
’s. On the one hand, this would make it easier for datasets to conveniently bundle meta data. For example, the COCO dataset could return a CocoLabel
, which in addition to the default Label.category
could also have the super_category
field. On the other hand, this would also mean that the transforms need to handle subclasses of features well, for example a CocoLabel
could be treated the same as a Label
.
I see two downsides with that:
- What if a transform needs the additional meta data carried by a feature subclass? Imagine I’ve added a special transformation that needs
CocoLabel.super_category
. Although from the surface this now supports plainLabel
’s this will fail at runtime. - Documentation custom features is more complicated than documenting a separate field in the sample dictionary of a dataset.
Thus, I’m leaning towards only having a few base classes.
From what data should a Feature
be instantiable?
Some of the features like Image
or Video
have non-tensor objects that carry the data. Should these features know how to handle them? For example should something like Image(PIL.Image.open(...))
work?
My vote is out for yes. IMO this is very convenient and also not an unexpected semantic compared to passing the data directly, e.g. Image(torch.rand(3, 256, 256))
Should Feature
’s have a fixed shape?
Consider the following table:
Feature |
.shape |
---|---|
Image |
(*, C, H, W) |
Label |
(*) |
BoundingBox |
(*, 4) |
SemanticSegmentationMask |
(*, H, W) or (*, C, H, W) |
InstanceSegementationMask |
(*, N, H, W) |
Video |
(*, T, C, H, W) |
(For SemanticSegmentationMask
I’m not sure about the shape yet. Having an extra channel dimension makes the tensor unnecessarily large, but it aligns well with segmentation image files, which are usually stored as RGB)
Should we fix the shape to a single feature, i.e. remove the *
from the table above, or should we only care about the shape in the last dimensions to be correct?
My vote is out for having a flexible shape, since otherwise batching is not possible. For example, if we fix bounding boxes to shape (4,)
a transformation would need to transform N
bounding boxes individually, while for shape (N, 4)
it could make use of parallelism.
On the same note, if we go for the flexible shape, do we keep the singular name of the feature? For example, do we regard a batch of images with shape (B, C, H, W)
still as Image
or should we go for the plural Images
in general? My vote is out for always keeping the singular, since I’ve often seen something like:
for image, target in data_loader(dataset, batch_size=4):
...
Should Feature
’s have a fixed dtype?
This makes sense for InstanceSegementationMask
which should always be torch.bool
. For all the other features I’m unsure. My gut says to use a default dtype, but also allow other dtypes.
What meta data should Feature
’s carry?
IMO, this really depends on the decision above about the fixed / flexible shapes. If we go for fixed shapes, it can basically carry any information. If we go for flexible shapes instead, we should only have meta data, which is the same for batched features. For example, BoundingBox.image_size
is fine, but Label.category
is not.
What methods should Feature
’s provide?
For now I’ve only included typical conversion methods, but of course this is not exhaustive.
Feature |
method(s) |
---|---|
Image |
.to_dtype() |
.to_colorspace() |
|
Label |
.to_str() |
BoundingBox |
.to_format() |
InstanceSegementationMask |
.to_semantic() |
cc @bjuncek
Issue Analytics
- State:
- Created 2 years ago
- Comments:19 (8 by maintainers)
Top GitHub Comments
@vadimkantorov
I think we already have what you want. Citing @datumbox from https://github.com/pytorch/vision/issues/5045#issuecomment-1034814339
The low-level functions will work with vanilla tensors, so you don’t have to use the new abstractions if you don’t want. Tentative plan is to expose them from
torchvision.transforms.kernels
. Have a look at #5323 or the underlying branch https://github.com/pmeier/vision/tree/transforms/dispatch/torchvision/prototype/transforms/kernels.I think your comment gets at the motivation behind some of the questions I ask. e.g. I think it’s significant, for example, that we don’t force users to extend some kind of TorchvisionModule or LightningModule (though of course you can use torchvision models in lightning). Many vision frameworks already extend and build frameworks on top of torchvision, and we wouldn’t want to inhibit that, so need to be careful about our abstractions.
That also has to be balanced with what functionality we want to offer too though, e.g. it would be more convenient to be able to support non-RGB images (YUV from videos) without extra parameters on every image operator.