Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Add ability to output Hive/Impala compatible timestamps

See original GitHub issue

Hey,

I am doing some work with Amazon’s Athena (Presto under the hood) system and using fastparquet to convert JSON files to parquet format. However, when I output datetime[ns] fields and read them in Athena, I get incorrect results:

Actual Dates:

"2016-08-08 23:08:22"
"2016-08-08 23:08:22"
"2016-08-08 23:08:22"
"2016-08-08 23:08:22"
"2016-08-08 23:08:22"
"2016-08-08 23:08:22"
"2016-08-08 23:08:23"
"2016-08-08 23:08:22"
"2016-08-08 23:08:22"
"2016-08-08 23:08:22"

Returned Dates:

"+48575-01-04 19:26:40.000"
"+48575-01-04 19:26:40.000"
"+48575-01-04 19:26:40.000"
"+48575-01-04 19:26:40.000"
"+48575-01-04 19:26:40.000"
"+48575-01-04 19:26:40.000"
"+48575-01-04 19:43:20.000"
"+48575-01-04 19:26:40.000"
"+48575-01-04 19:26:40.000"
"+48575-01-04 19:26:40.000"

This is because Hive (and Impala) use a different timestamp format (int96) than the Parquet default. Check out these posts for more details:

It would be helpful if it used the compatible TIMESTAMP format when writing with file_scheme='hive'.

Issue Analytics

State:
Created 7 years ago
Reactions:4
Comments:21 (11 by maintainers)

Top GitHub Comments

1reaction

martindurantcommented, Feb 9, 2017

Excellent. PR #66 is almost ready for py2 support, and I expect both of these to be merged soon.

1reaction

martindurantcommented, Feb 9, 2017

I have pandas 0.19, so that’s probably the difference
datetime64[ns] and M8[ns] are roughly equivalent (the former is the Pandas string version of the latter). I meant that you should output the parquet file using fastparquet as before, but do something like

CREATE EXTERNAL TABLE fastparquet_test (
    id STRING,
    date_added BIGINT
) STORED AS PARQUET
LOCATION 's3://yipit-test/test_fastparquet';

SELECT id, time_from_unix(date_added) FROM fastparquet_test;

where BIGINT and time_from_unix are my guesses of the appropriate athena terms. HiveQL seems to need the integer in seconds, and the data has it in us, so you would need from_unixtime(date_added / 1000000).

Have you tried the new output with MR-times in #83 ?

Top Results From Across the Web

TIMESTAMP data type | CDP Public Cloud

In Impala, the TIMESTAMP data type holds a value of date and time. It can be decomposed into year, month, day, hour, minute...

Impala Date and Time Functions

Supports the same date and time units as EXTRACT() . For compatibility with SQL code containing vendor extensions. Return type: BIGINT. DATE_SUB(TIMESTAMP ......

Impala timestamps don't match Hive - a timezone issue?

It seems that Impala is taking events that are already in UTC, incorrectly assuming they're in America/Denver time, and adding another 7 hours....

4. Common Developer Tasks for Impala - Getting Started with ...

(This cross-compatibility applies to Hive tables that use ... system or a data warehouse with limited capacity, you can bring it into Impala...

parquet int96

这是因为在某些大数据系统(如Hive, Impala) 中, 使用特殊的int96 类型来表示 ... We convert Parquet int96 timestamps to a format directly compatible with e.