question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Nested data type support

See original GitHub issue

A lot of big data has nested data types, including structs, arrays, and maps. These are specific to big data / Spark and are not found in pandas, so we would need to design the APIs. It’s best to write a short design doc on how to access and operate on the aforementioned three types.

Some initial thoughts after discussing with @thunterdb

We could reuse Spark’s acessors, e.g. a.b means struct a’s subfield b; a[0] means first element of array a. df['struct_field'] should return a DataFrame (so a DataFrame is internally backed by a query plan along with a column expression).

Note that the precedence should be if there is a column named a.b, then df['a.b'] should return a Series representing that column. But if there is no column named a.b, then df['a.b'] should return a Series representing struct column a’s sub-field b.

Issue Analytics

  • State:open
  • Created 4 years ago
  • Comments:9 (6 by maintainers)

github_iconTop GitHub Comments

2reactions
jrebackcommented, Jun 3, 2019

are you folks aware of: http://pandas.pydata.org/pandas-docs/stable/development/extending.html#extension-types

these are in-pandas support for arbitrary data types defined (internally or externally). These in fact were developed to allow for nested data types. Using a backing store of pyarrow nested types shouldn’t be that hard to support.

1reaction
jrebackcommented, Jun 5, 2019

pyarrow is soon going to be able to fully replicate a round-trip of a DataFrame with extension types (and serialize to parquet), see: https://issues.apache.org/jira/browse/ARROW-2428, https://issues.apache.org/jira/browse/ARROW-5271, https://issues.apache.org/jira/browse/ARROW-3829

Read more comments on GitHub >

github_iconTop Results From Across the Web

Nested Data Type Definitions - Informatica Documentation
The nested data type definition Company references the following complex data type definitions: In the complex data type definition Company, the array element ......
Read more >
Nested field type | Elasticsearch Guide [8.5] | Elastic
The nested type is a specialised version of the object data type that allows arrays of objects to be indexed in a way...
Read more >
Nested Data Basics - Trifacta Documentation
Nested Data. Designer Cloud Powered by Trifacta supports two types of nested data types: Array: a series of comma-separated values.
Read more >
Do we need nested datatypes? - UPenn CIS
Sometimes we want non-regular datatypes. Typed functional programming languages excel at representing tree structured data. Most of the time ...
Read more >
Nested Data Types - Okera Documentation
Note: Maps are not supported for JSON file format. Structs¶. ODAS supports struct types with up to 100 levels of nesting. Struct types...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found