Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Support for automatic index and timezone recovery from Parquet files.

See original GitHub issue

When Parquet files are created by Wrangler/Pandas some metadata are stored into the file with hints about how to reconstruct Indexes correctly or how to localize a datetime column with a specific time zone.

When wr.s3.to_parquet(index=True) we want to preserve the current behaviour of materialize the index(es) as a parquet column. This is important because non materialised index(es) will not be detected by engines such Spark, PrestoDB, Redshift Spectrum, Athena, Hive, etc.
Support for index(es) recovery following Pandas metadata injected in the files.
Support for MultiIndex
Support for RangeIndex
Support for datetime timezone recovery.

Related commits:

Add support for timezone and index for wr.s3.read_parquet() https://github.com/awslabs/aws-data-wrangler/commit/0434ec72c8096bc7112d1cda10b7c551080a1f6b
Improve index recovery https://github.com/awslabs/aws-data-wrangler/commit/50b80aee13c5c53db644faf284d89aedcc6ab6e7
Add support for multiindex recovery https://github.com/awslabs/aws-data-wrangler/commit/c64e7e0ccefeaa4814872d31a5ad8f7a7b3f3e69

Original discussion: https://github.com/awslabs/aws-data-wrangler/pull/339

cc: @alexifm @Digma

Issue Analytics

State:
Created 3 years ago
Comments:5 (3 by maintainers)

Top GitHub Comments

2reactions

alexifmcommented, Aug 28, 2020

Hey, I think this is awesome! I tested it out a bunch with different index types and partitions and I think the only round trip issues I found are known pyarrow issues, mostly relating to categoricals. Specific to what the wrangler is responsible for, I can’t find any issues.

I did not test anything with the timezones.

1reaction

Digmacommented, Sep 4, 2020

Tested it and everything seems to work ok for us. Thanks again!

Top Results From Across the Web

Support for automatic index and timezone recovery ... - GitHub

When Parquet files are created by Wrangler/Pandas some metadata are stored into the file with hints about how to reconstruct Indexes ...

TIMESTAMP compatibility for Parquet files | CDP Private Cloud

TIMESTAMP compatibility for Parquet files. Impala stores and retrieves the TIMESTAMP values verbatim, with no adjustment for the time zone.

DBMS_CLOUD Package Format Options for Avro, ORC, or ...

The format argument in DBMS_CLOUD specifies the format of source files. ... types are automatically derived from the Avro, ORC, or Parquet file...

Parquet Files - Spark 3.3.1 Documentation

Spark SQL provides support for both reading and writing Parquet files that automatically preserves the schema of the original data. When reading Parquet ......

gphdfs Support for Parquet Files (Deprecated) - VMware Docs

The Greenplum Database gphdfs protocol supports the Parquet file format version 1 or 2. Parquet takes advantage of compressed, columnar data ...