question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Support for automatic index and timezone recovery from Parquet files.

See original GitHub issue

When Parquet files are created by Wrangler/Pandas some metadata are stored into the file with hints about how to reconstruct Indexes correctly or how to localize a datetime column with a specific time zone.

  • When wr.s3.to_parquet(index=True) we want to preserve the current behaviour of materialize the index(es) as a parquet column. This is important because non materialised index(es) will not be detected by engines such Spark, PrestoDB, Redshift Spectrum, Athena, Hive, etc.
  • Support for index(es) recovery following Pandas metadata injected in the files.
  • Support for MultiIndex
  • Support for RangeIndex
  • Support for datetime timezone recovery.

Related commits:

Original discussion: https://github.com/awslabs/aws-data-wrangler/pull/339

cc: @alexifm @Digma

Issue Analytics

  • State:closed
  • Created 3 years ago
  • Comments:5 (3 by maintainers)

github_iconTop GitHub Comments

2reactions
alexifmcommented, Aug 28, 2020

Hey, I think this is awesome! I tested it out a bunch with different index types and partitions and I think the only round trip issues I found are known pyarrow issues, mostly relating to categoricals. Specific to what the wrangler is responsible for, I can’t find any issues.

I did not test anything with the timezones.

1reaction
Digmacommented, Sep 4, 2020

Tested it and everything seems to work ok for us. Thanks again!

Read more comments on GitHub >

github_iconTop Results From Across the Web

Support for automatic index and timezone recovery ... - GitHub
When Parquet files are created by Wrangler/Pandas some metadata are stored into the file with hints about how to reconstruct Indexes ...
Read more >
TIMESTAMP compatibility for Parquet files | CDP Private Cloud
TIMESTAMP compatibility for Parquet files. Impala stores and retrieves the TIMESTAMP values verbatim, with no adjustment for the time zone.
Read more >
DBMS_CLOUD Package Format Options for Avro, ORC, or ...
The format argument in DBMS_CLOUD specifies the format of source files. ... types are automatically derived from the Avro, ORC, or Parquet file...
Read more >
Parquet Files - Spark 3.3.1 Documentation
Spark SQL provides support for both reading and writing Parquet files that automatically preserves the schema of the original data. When reading Parquet ......
Read more >
gphdfs Support for Parquet Files (Deprecated) - VMware Docs
The Greenplum Database gphdfs protocol supports the Parquet file format version 1 or 2. Parquet takes advantage of compressed, columnar data ...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found