question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Add ability to output Hive/Impala compatible timestamps

See original GitHub issue

Hey,

I am doing some work with Amazon’s Athena (Presto under the hood) system and using fastparquet to convert JSON files to parquet format. However, when I output datetime[ns] fields and read them in Athena, I get incorrect results:

Actual Dates:

"2016-08-08 23:08:22"
"2016-08-08 23:08:22"
"2016-08-08 23:08:22"
"2016-08-08 23:08:22"
"2016-08-08 23:08:22"
"2016-08-08 23:08:22"
"2016-08-08 23:08:23"
"2016-08-08 23:08:22"
"2016-08-08 23:08:22"
"2016-08-08 23:08:22"

Returned Dates:

"+48575-01-04 19:26:40.000"
"+48575-01-04 19:26:40.000"
"+48575-01-04 19:26:40.000"
"+48575-01-04 19:26:40.000"
"+48575-01-04 19:26:40.000"
"+48575-01-04 19:26:40.000"
"+48575-01-04 19:43:20.000"
"+48575-01-04 19:26:40.000"
"+48575-01-04 19:26:40.000"
"+48575-01-04 19:26:40.000"

This is because Hive (and Impala) use a different timestamp format (int96) than the Parquet default. Check out these posts for more details:

  1. http://stackoverflow.com/questions/28292876/hives-timestamp-is-same-as-parquets-timestamp
  2. https://community.mapr.com/thread/18883-getting-weird-output-for-date-timestamp-data-type-columns-while-selecting-data-from-parquet-file-in-drill
  3. https://github.com/Parquet/parquet-mr/issues/218

It would be helpful if it used the compatible TIMESTAMP format when writing with file_scheme='hive'.

Issue Analytics

  • State:closed
  • Created 7 years ago
  • Reactions:4
  • Comments:21 (11 by maintainers)

github_iconTop GitHub Comments

1reaction
martindurantcommented, Feb 9, 2017

Excellent. PR #66 is almost ready for py2 support, and I expect both of these to be merged soon.

1reaction
martindurantcommented, Feb 9, 2017
  1. I have pandas 0.19, so that’s probably the difference
  2. datetime64[ns] and M8[ns] are roughly equivalent (the former is the Pandas string version of the latter). I meant that you should output the parquet file using fastparquet as before, but do something like
CREATE EXTERNAL TABLE fastparquet_test (
    id STRING,
    date_added BIGINT
) STORED AS PARQUET
LOCATION 's3://yipit-test/test_fastparquet';

SELECT id, time_from_unix(date_added) FROM fastparquet_test;

where BIGINT and time_from_unix are my guesses of the appropriate athena terms. HiveQL seems to need the integer in seconds, and the data has it in us, so you would need from_unixtime(date_added / 1000000).

Have you tried the new output with MR-times in #83 ?

Read more comments on GitHub >

github_iconTop Results From Across the Web

TIMESTAMP data type | CDP Public Cloud
In Impala, the TIMESTAMP data type holds a value of date and time. It can be decomposed into year, month, day, hour, minute...
Read more >
Impala Date and Time Functions
Supports the same date and time units as EXTRACT() . For compatibility with SQL code containing vendor extensions. Return type: BIGINT. DATE_SUB(TIMESTAMP ......
Read more >
Impala timestamps don't match Hive - a timezone issue?
It seems that Impala is taking events that are already in UTC, incorrectly assuming they're in America/Denver time, and adding another 7 hours....
Read more >
4. Common Developer Tasks for Impala - Getting Started with ...
(This cross-compatibility applies to Hive tables that use ... system or a data warehouse with limited capacity, you can bring it into Impala...
Read more >
parquet int96
这是因为在某些大数据系统(如Hive, Impala) 中, 使用特殊的int96 类型来表示 ... We convert Parquet int96 timestamps to a format directly compatible with e.
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found