Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Data dependent/unknown shapes

See original GitHub issue

Some libraries, particularly those with a graph-based computational model (e.g., Dask and TensorFlow), have support for “unknown” or “data dependent” shapes, e.g., due to boolean indexing such as x[y > 0] (https://github.com/data-apis/array-api/issues/84). Other libraries (e.g., JAX and Dask in some cases) do not support some operations because they would produce such data dependent shapes.

We should consider a standard way to represent these shapes in shape attributes, ideally some extension of the “tuple of integer” format used for fully known shapes. For example, TensorFlow and Dask currently use different representations:

TensorFlow uses a custom TensorShape object (which acts very similarly to tuple), where some values may be None
Dask uses tuples, where some values may be nan integer of integers

Issue Analytics

State:
Created 3 years ago
Comments:18 (14 by maintainers)

Top GitHub Comments

1reaction

kgrytecommented, Oct 21, 2021

To summarize the above discussion and the discussions during the consortium meetings (e.g., 7 October 2021), the path forward seems to be as follows:

if rank is unknown, ndim should be None and shape should be None.
if rank is known but dimensions are unknown, ndim should be an int and shape should be a tuple where known shape dimensions should be ints and unknown shape dimensions should be None.
if rank is known and dimensions are known, ndim should be an int and shape should be a tuple whose dimensions should be ints.
in most cases, shape should be a tuple. For those use cases where a custom object is needed, the custom object must act like a tuple.

We can consider adding a functional shape API that supports returning the (dynamic) shape of an array as an array. This would be similar to TensorFlow’s tf.shape and MXNet’s shape_array APIs. This API would allow returning the shape of an array as the result of delayed computation. We can push this decision to the 2022 revision of the specification.

The above would not satisfy @jakirkham’s desire for behavior which poisons subsequent operations; however, as discussed here and elsewhere, NaN does not seem like the right abstraction (e.g., value equality, introducing floating-point semantics, etc). While shape arithmetic is a valid use case, we should consider ways to ensure that this is not a userland concern. For implementations, shape arithmetic can be necessary for allocation (e.g., concat, tiling, etc), but it would seem doable to mimic NaN poison behavior with None and straightforward helper functions.

A custom object as suggested by @shoyer would be doable; however, this is an abstraction which does not currently exist in or is used by array libraries (although, it does exist in dataframe libraries, such as pandas) and would be specific to each implementing array library. The advantage of None and NaN is they exist independently of any one array library.

Accordingly, I think the lowest common denominator is requiring shape return a tuple (or something tuple-like) and using None for unknown dimensions. While not perfect, this seems to me the most straightforward path atm.

1reaction

rgommerscommented, Dec 16, 2020

Alternatively users could check for ndarray.ndim is None leaving ndarray.shape undefined for unknown rank cases.

~I don’t think we included anything in the standard that could cause this.~ ~Indexing may make the size of one or more dimensions unknown, but ndim should always be known because for data-dependent operations there’s no squeezing:~

>>> x = np.ones((3,2))
>>> x[:0, np.zeros(2, dtype=bool)]
array([], shape=(0, 0), dtype=float64)

After writing this: we did include squeeze() so a combination of indexing and explicit squeezing could indeed cause this.

Leaving any property (like .shape) undefined seems unhealthy. I’d say in this case, use ndim = None and shape is None. If dimensionality is known but exact shape isn’t, use None for the unknown dimensions (e.g., shape = (3, None, 100)).

and just has its TensorShape be a tuple of either integers or None to indicate shapes which we do not know at tracing time. So I think saying shapes are tuples here is probably a good idea.

Maybe we should say “shape is tuple” and add a note that if for backwards compat reasons an implementation is using a custom object, then it should make sure that it is a subtype of tuple so that it works when users annotate code using .shape with Tuple.

Top Results From Across the Web

Data-Driven Shape Analysis and Processing

Data -driven methods serve an increasingly important role in dis- covering geometric, structural, and semantic relationships between shapes.

Requirements - RDF Data Shapes Working Group

This page provides a (currently short) description of requirements for constraints/shapes relevant to the RDF Data Shapes Working Group and their status.

Data-Driven Shape Analysis and Processing - Vova Kim

Data -driven methods serve an increasingly important role in discovering geometric, structural, and semantic relationships between shapes.

Shapes Data The goal of this part of the lab | Chegg.com

The following data was collected in the lab by measuring the shapes with a ruler and weighing them on a balance. (Note: all...

Lotico Data Shapes Event - YouTube

RDF is a fundamental part of the web (championed by the semantic web and knowledge graph efforts). It has a versatile data model...