Support for automatic index and timezone recovery from Parquet files.
See original GitHub issueWhen Parquet files are created by Wrangler/Pandas some metadata are stored into the file with hints about how to reconstruct Indexes correctly or how to localize a datetime column with a specific time zone.
- When
wr.s3.to_parquet(index=True)
we want to preserve the current behaviour of materialize the index(es) as a parquet column. This is important because non materialised index(es) will not be detected by engines such Spark, PrestoDB, Redshift Spectrum, Athena, Hive, etc. - Support for index(es) recovery following Pandas metadata injected in the files.
- Support for MultiIndex
- Support for RangeIndex
- Support for datetime timezone recovery.
Related commits:
- Add support for timezone and index for wr.s3.read_parquet() https://github.com/awslabs/aws-data-wrangler/commit/0434ec72c8096bc7112d1cda10b7c551080a1f6b
- Improve index recovery https://github.com/awslabs/aws-data-wrangler/commit/50b80aee13c5c53db644faf284d89aedcc6ab6e7
- Add support for multiindex recovery https://github.com/awslabs/aws-data-wrangler/commit/c64e7e0ccefeaa4814872d31a5ad8f7a7b3f3e69
Original discussion: https://github.com/awslabs/aws-data-wrangler/pull/339
Issue Analytics
- State:
- Created 3 years ago
- Comments:5 (3 by maintainers)
Top Results From Across the Web
Support for automatic index and timezone recovery ... - GitHub
When Parquet files are created by Wrangler/Pandas some metadata are stored into the file with hints about how to reconstruct Indexes ...
Read more >TIMESTAMP compatibility for Parquet files | CDP Private Cloud
TIMESTAMP compatibility for Parquet files. Impala stores and retrieves the TIMESTAMP values verbatim, with no adjustment for the time zone.
Read more >DBMS_CLOUD Package Format Options for Avro, ORC, or ...
The format argument in DBMS_CLOUD specifies the format of source files. ... types are automatically derived from the Avro, ORC, or Parquet file...
Read more >Parquet Files - Spark 3.3.1 Documentation
Spark SQL provides support for both reading and writing Parquet files that automatically preserves the schema of the original data. When reading Parquet ......
Read more >gphdfs Support for Parquet Files (Deprecated) - VMware Docs
The Greenplum Database gphdfs protocol supports the Parquet file format version 1 or 2. Parquet takes advantage of compressed, columnar data ...
Read more >
Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free
Top Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Hey, I think this is awesome! I tested it out a bunch with different index types and partitions and I think the only round trip issues I found are known pyarrow issues, mostly relating to categoricals. Specific to what the wrangler is responsible for, I can’t find any issues.
I did not test anything with the timezones.
Tested it and everything seems to work ok for us. Thanks again!