Add ability to output Hive/Impala compatible timestamps
See original GitHub issueHey,
I am doing some work with Amazon’s Athena (Presto under the hood) system and using fastparquet
to convert JSON files to parquet
format. However, when I output datetime[ns]
fields and read them in Athena, I get incorrect results:
Actual Dates:
"2016-08-08 23:08:22"
"2016-08-08 23:08:22"
"2016-08-08 23:08:22"
"2016-08-08 23:08:22"
"2016-08-08 23:08:22"
"2016-08-08 23:08:22"
"2016-08-08 23:08:23"
"2016-08-08 23:08:22"
"2016-08-08 23:08:22"
"2016-08-08 23:08:22"
Returned Dates:
"+48575-01-04 19:26:40.000"
"+48575-01-04 19:26:40.000"
"+48575-01-04 19:26:40.000"
"+48575-01-04 19:26:40.000"
"+48575-01-04 19:26:40.000"
"+48575-01-04 19:26:40.000"
"+48575-01-04 19:43:20.000"
"+48575-01-04 19:26:40.000"
"+48575-01-04 19:26:40.000"
"+48575-01-04 19:26:40.000"
This is because Hive (and Impala) use a different timestamp format (int96
) than the Parquet default. Check out these posts for more details:
- http://stackoverflow.com/questions/28292876/hives-timestamp-is-same-as-parquets-timestamp
- https://community.mapr.com/thread/18883-getting-weird-output-for-date-timestamp-data-type-columns-while-selecting-data-from-parquet-file-in-drill
- https://github.com/Parquet/parquet-mr/issues/218
It would be helpful if it used the compatible TIMESTAMP
format when writing with file_scheme='hive'
.
Issue Analytics
- State:
- Created 7 years ago
- Reactions:4
- Comments:21 (11 by maintainers)
Top Results From Across the Web
TIMESTAMP data type | CDP Public Cloud
In Impala, the TIMESTAMP data type holds a value of date and time. It can be decomposed into year, month, day, hour, minute...
Read more >Impala Date and Time Functions
Supports the same date and time units as EXTRACT() . For compatibility with SQL code containing vendor extensions. Return type: BIGINT. DATE_SUB(TIMESTAMP ......
Read more >Impala timestamps don't match Hive - a timezone issue?
It seems that Impala is taking events that are already in UTC, incorrectly assuming they're in America/Denver time, and adding another 7 hours....
Read more >4. Common Developer Tasks for Impala - Getting Started with ...
(This cross-compatibility applies to Hive tables that use ... system or a data warehouse with limited capacity, you can bring it into Impala...
Read more >parquet int96
这是因为在某些大数据系统(如Hive, Impala) 中, 使用特殊的int96 类型来表示 ... We convert Parquet int96 timestamps to a format directly compatible with e.
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
Excellent. PR #66 is almost ready for py2 support, and I expect both of these to be merged soon.
where BIGINT and time_from_unix are my guesses of the appropriate athena terms. HiveQL seems to need the integer in seconds, and the data has it in us, so you would need
from_unixtime(date_added / 1000000)
.Have you tried the new output with MR-times in #83 ?