Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Nested data type support

See original GitHub issue

A lot of big data has nested data types, including structs, arrays, and maps. These are specific to big data / Spark and are not found in pandas, so we would need to design the APIs. It’s best to write a short design doc on how to access and operate on the aforementioned three types.

Some initial thoughts after discussing with @thunterdb …

We could reuse Spark’s acessors, e.g. a.b means struct a’s subfield b; a[0] means first element of array a. df['struct_field'] should return a DataFrame (so a DataFrame is internally backed by a query plan along with a column expression).

Note that the precedence should be if there is a column named a.b, then df['a.b'] should return a Series representing that column. But if there is no column named a.b, then df['a.b'] should return a Series representing struct column a’s sub-field b.

Issue Analytics

State:
Created 4 years ago
Comments:9 (6 by maintainers)

Top GitHub Comments

2reactions

jrebackcommented, Jun 3, 2019

are you folks aware of: http://pandas.pydata.org/pandas-docs/stable/development/extending.html#extension-types

these are in-pandas support for arbitrary data types defined (internally or externally). These in fact were developed to allow for nested data types. Using a backing store of pyarrow nested types shouldn’t be that hard to support.

1reaction

jrebackcommented, Jun 5, 2019

pyarrow is soon going to be able to fully replicate a round-trip of a DataFrame with extension types (and serialize to parquet), see: https://issues.apache.org/jira/browse/ARROW-2428, https://issues.apache.org/jira/browse/ARROW-5271, https://issues.apache.org/jira/browse/ARROW-3829

Top Results From Across the Web

Nested Data Type Definitions - Informatica Documentation

The nested data type definition Company references the following complex data type definitions: In the complex data type definition Company, the array element ......

Nested field type | Elasticsearch Guide [8.5] | Elastic

The nested type is a specialised version of the object data type that allows arrays of objects to be indexed in a way...

Nested Data Basics - Trifacta Documentation

Nested Data. Designer Cloud Powered by Trifacta supports two types of nested data types: Array: a series of comma-separated values.

Do we need nested datatypes? - UPenn CIS

Sometimes we want non-regular datatypes. Typed functional programming languages excel at representing tree structured data. Most of the time ...

Nested Data Types - Okera Documentation

Note: Maps are not supported for JSON file format. Structs¶. ODAS supports struct types with up to 100 levels of nesting. Struct types...