Nested data type support
See original GitHub issueA lot of big data has nested data types, including structs, arrays, and maps. These are specific to big data / Spark and are not found in pandas, so we would need to design the APIs. It’s best to write a short design doc on how to access and operate on the aforementioned three types.
Some initial thoughts after discussing with @thunterdb …
We could reuse Spark’s acessors, e.g. a.b
means struct a’s subfield b; a[0]
means first element of array a. df['struct_field']
should return a DataFrame (so a DataFrame is internally backed by a query plan along with a column expression).
Note that the precedence should be if there is a column named a.b
, then df['a.b']
should return a Series representing that column. But if there is no column named a.b
, then df['a.b']
should return a Series representing struct column a’s sub-field b.
Issue Analytics
- State:
- Created 4 years ago
- Comments:9 (6 by maintainers)
Top GitHub Comments
are you folks aware of: http://pandas.pydata.org/pandas-docs/stable/development/extending.html#extension-types
these are in-pandas support for arbitrary data types defined (internally or externally). These in fact were developed to allow for nested data types. Using a backing store of
pyarrow
nested types shouldn’t be that hard to support.pyarrow is soon going to be able to fully replicate a round-trip of a DataFrame with extension types (and serialize to parquet), see: https://issues.apache.org/jira/browse/ARROW-2428, https://issues.apache.org/jira/browse/ARROW-5271, https://issues.apache.org/jira/browse/ARROW-3829