Data dependent/unknown shapes
See original GitHub issueSome libraries, particularly those with a graph-based computational model (e.g., Dask and TensorFlow), have support for “unknown” or “data dependent” shapes, e.g., due to boolean indexing such as x[y > 0]
(https://github.com/data-apis/array-api/issues/84). Other libraries (e.g., JAX and Dask in some cases) do not support some operations because they would produce such data dependent shapes.
We should consider a standard way to represent these shapes in shape
attributes, ideally some extension of the “tuple of integer” format used for fully known shapes. For example, TensorFlow and Dask currently use different representations:
- TensorFlow uses a custom
TensorShape
object (which acts very similarly totuple
), where some values may beNone
- Dask uses tuples, where some values may be
nan
integer of integers
Issue Analytics
- State:
- Created 3 years ago
- Comments:18 (14 by maintainers)
Top Results From Across the Web
Data-Driven Shape Analysis and Processing
Data -driven methods serve an increasingly important role in dis- covering geometric, structural, and semantic relationships between shapes.
Read more >Requirements - RDF Data Shapes Working Group
This page provides a (currently short) description of requirements for constraints/shapes relevant to the RDF Data Shapes Working Group and their status.
Read more >Data-Driven Shape Analysis and Processing - Vova Kim
Data -driven methods serve an increasingly important role in discovering geometric, structural, and semantic relationships between shapes.
Read more >Shapes Data The goal of this part of the lab | Chegg.com
The following data was collected in the lab by measuring the shapes with a ruler and weighing them on a balance. (Note: all...
Read more >Lotico Data Shapes Event - YouTube
RDF is a fundamental part of the web (championed by the semantic web and knowledge graph efforts). It has a versatile data model...
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
To summarize the above discussion and the discussions during the consortium meetings (e.g., 7 October 2021), the path forward seems to be as follows:
ndim
should beNone
andshape
should beNone
.ndim
should be anint
andshape
should be atuple
where knownshape
dimensions should beint
s and unknownshape
dimensions should beNone
.ndim
should be anint
andshape
should be atuple
whose dimensions should beint
s.shape
should be atuple
. For those use cases where a custom object is needed, the custom object must act like atuple
.We can consider adding a functional
shape
API that supports returning the (dynamic) shape of an array as an array. This would be similar to TensorFlow’stf.shape
and MXNet’sshape_array
APIs. This API would allow returning the shape of an array as the result of delayed computation. We can push this decision to the 2022 revision of the specification.The above would not satisfy @jakirkham’s desire for behavior which poisons subsequent operations; however, as discussed here and elsewhere,
NaN
does not seem like the right abstraction (e.g., value equality, introducing floating-point semantics, etc). While shape arithmetic is a valid use case, we should consider ways to ensure that this is not a userland concern. For implementations, shape arithmetic can be necessary for allocation (e.g.,concat
, tiling, etc), but it would seem doable to mimicNaN
poison behavior withNone
and straightforward helper functions.A custom object as suggested by @shoyer would be doable; however, this is an abstraction which does not currently exist in or is used by array libraries (although, it does exist in dataframe libraries, such as pandas) and would be specific to each implementing array library. The advantage of
None
andNaN
is they exist independently of any one array library.Accordingly, I think the lowest common denominator is requiring
shape
return atuple
(or something tuple-like) and usingNone
for unknown dimensions. While not perfect, this seems to me the most straightforward path atm.~I don’t think we included anything in the standard that could cause this.~ ~Indexing may make the size of one or more dimensions unknown, but
ndim
should always be known because for data-dependent operations there’s no squeezing:~After writing this: we did include
squeeze()
so a combination of indexing and explicit squeezing could indeed cause this.Leaving any property (like
.shape
) undefined seems unhealthy. I’d say in this case, usendim = None
and shape isNone
. If dimensionality is known but exact shape isn’t, useNone
for the unknown dimensions (e.g.,shape = (3, None, 100)
).Maybe we should say “shape is tuple” and add a note that if for backwards compat reasons an implementation is using a custom object, then it should make sure that it is a subtype of tuple so that it works when users annotate code using
.shape
withTuple
.