Pyarrow schema registration in Glue
See original GitHub issueIssue is similar to #29 but for Pyarrow.
Pyarrow supports richer types than pandas, in our case ListArray
, which translates to array<int>
in Glue.
The current implementation requires to go through pandas, which stores it in an object
column which then gets added as string
to the schema.
Looking at the code, it looks like we reconstruct the Pyarrow schema anyway, and it might be as simple as expose this entry point as well as pandas.
Issue Analytics
- State:
- Created 4 years ago
- Reactions:2
- Comments:7 (3 by maintainers)
Top Results From Across the Web
Pyarrow schema registration in Glue · Issue #32 - GitHub
The current implementation requires to go through pandas, which stores it in an object column which then gets added as string to the...
Read more >pyarrow.Schema — Apache Arrow v10.0.1
A named collection of types a.k.a schema. A schema defines the column names and types in a record batch or table data structure....
Read more >AWS Glue Schema Registry
The AWS Glue Schema Registry is a new feature that allows you to centrally discover, control, and evolve data stream schemas. A schema...
Read more >mojap-metadata - PyPI
This python package allows users to read and alter our metadata schemas (using the metadata module) as well as convert our metadata schemas...
Read more >awswrangler.s3.read_parquet_table — AWS SDK for pandas ...
Read Apache Parquet table registered on AWS Glue Catalog. ... for conversion of built-in pyarrow types or in absence of pandas_metadata in the...
Read more >
Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free
Top Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
PR #33 updated to allow casting of data types as arguments… Unfortunately Pyarrow hasn’t type alias for nested types.
Ref: https://github.com/apache/arrow/blob/apache-arrow-0.14.1/python/pyarrow/types.pxi#L1684
Maybe in the future we could prioritize the change to accept the data type objects itself instead of the alias. But by now, I think that it is enough to close this issue.
Thank!
to support
ListType
in Pyarrow schemas, it might be as simple as replacing lines 286 to 293 inawswrangler.glue
by the following snippet.By the way, it looks like it does not support
DataType(null)
either (which can happen if the whole column is empty). We decided to cast it tostring
but it is arbitrary