Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Determining the resultant metadata of a function

See original GitHub issue

When wrapping other Array libraries (as happens in Dask or XArray), there is a need to determine what the result of an operation may look like in terms of its metadata. This typically happens before any real computation has begun.

For example take a.sum(axis=0), we would like to determine the data type, shape, etc. for this resultant array without computing it. Currently this is done by carrying around a a._meta attribute with a sample array that has similar characteristics, but is much smaller and easier to operate on. This a._meta object is then passed to operations (like a._meta.sum(axis=0)) and the result is inspected to ascertain what would likely happen to the result from a.sum(axis=0). This isn’t perfect and some cases with UDFs can get tricky (like apply_along_axis). However it still works reasonably well for common use cases.

That said, it would be nice to have an API solution that was not reliant on doing these sample computations. Admittedly there may not be an easy answer to this use case, but wanted to raise it for discussion given this could be quite helpful when reasoning about applying operations to large arrays.

Note: While this comes up with Arrays, there is similar logic for DataFrames as well.

Issue Analytics

State:
Created a year ago
Comments:5 (5 by maintainers)

Top GitHub Comments

1reaction

rgommerscommented, Jun 8, 2022

This would be quite interesting, but I think also complex to implement? It reminds me of the meta backend in PyTorch. NumPy has some functions for parts of this, like broadcast_shapes and result_type. But doing the whole thing for all functions the API supports isn’t possible with NumPy primitives AFAIK.

Currently this is done by carrying around a a._meta attribute with a sample array that has similar characteristics, but is much smaller and easier to operate on.

That sounds like a decent implementation choice for Dask, although to deal with corner cases like 0-D arrays you probably need a bunch of logic (?). It doesn’t make sense for other libraries probably; we’d really need a classification of operations for shape behavior (“element-wise”, “reduction”, etc. plus one-offs) as well as casting rules (maybe as ufunc-like signatures, ii -> f?) and then from-first-principles calculations I’d think.

It’d be very nice to see an implementation if anyone has something like this floating around somewhere.

0reactions

asmeurercommented, Sep 5, 2022

For data type, most functions do type promotion but there are a few exceptions, like equal which always returns bool. These categories could be spelled out in the signatures package https://github.com/data-apis/array-api/issues/411.

One challenge is that the spec only specifies a minimal set of required dtypes. It doesn’t disallow libraries from implementing additional dtypes on functions.

Shape I think is harder because the result shape depends on things like axis keyword arguments, so you’d really need a function to determine it given a specific function and input keyword arguments.

I think ideally all this stuff would be encoded in the type annotations somehow.

Top Results From Across the Web

ResultSetMetaData and Function Return Values

The ResultSetMetaData reports that the data type for column 1 is VARCHAR. To retrieve the actual data type for column 1 of the...

A simplified, modernized approach to metadata - Canto

Metadata is a complex term with a deceptively straightforward definition. It's usually defined as data that describes other data.

Metadata Schema - an overview | ScienceDirect Topics

Metadata design is the process of determining how the metadata schemas will be implemented, such as which elements are required, when and how...

How to get the number of columns from a JDBC ResultSet?

You can get columns number from ResultSetMetaData: Statement st = conn.createStatement(); ResultSet rs = st.executeQuery(query); ResultSetMetaData rsmd = rs ...

ResultSetMetaData (Java Platform SE 7 ) - Oracle Help Center

Returns the number of columns in this ResultSet object. int, getColumnDisplaySize(int column). Indicates the designated column's normal maximum width in ...