question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Timestamp not parsed correctly on Athena

See original GitHub issue

The data type of the timestamp columns are flawed inspite of being correct in the DataFrame. The schema of the DataFrame is:

root
 |-- branch_id: long (nullable = true)
 |-- comment: string (nullable = true)
 |-- created_at: timestamp (nullable = true)
...

The output of the DataFrame show():

--------------+--------------------+-----------------------+
|branch_id|             comment|         created_at|
--------------+--------------------+-----------------------+
|    13501|                    |2017-05-09 08:21:35|
|    14081|                    |2017-05-09 08:53:29|
...
--------------+--------------------+-----------------------+

The output in Athena after storing the DataFrame in hudi format.

--------------+--------------------+-----------------------+
|branch_id|             comment|         created_at|
--------------+--------------------+-----------------------+
|    13501|                    | +49134-01-07 05:30:00.000|
|    14081|                    | +49153-08-06 07:20:00.000|
...
--------------+--------------------+-----------------------+

Code to write the DataFrame “main_df” in Hudi format:

hudi_options = {
            'hoodie.table.name': table.name,
            'hoodie.datasource.write.recordkey.field': table.primary_key,
            'hoodie.datasource.write.partitionpath.field': partition_by,
            'hoodie.datasource.write.table.name': table.name,
            'hoodie.datasource.write.operation': "upsert",
            'hoodie.datasource.write.precombine.field': "ts_ms",
            'hoodie.upsert.shuffle.parallelism': 2,
            'hoodie.insert.shuffle.parallelism': 2
        }

main_df.write.format("hudi"). \
            options(**hudi_options). \
            mode("append"). \
            save(desturl)

The Issue is that Athena recognises int96 at timestamps and not int64 which is given by hudi. What is the fix for this?

Issue Analytics

  • State:closed
  • Created 3 years ago
  • Reactions:3
  • Comments:23 (10 by maintainers)

github_iconTop GitHub Comments

1reaction
jackye1995commented, Apr 22, 2021

Just FYI, a similar fix has been done after that ticket was created, and this issue should no longer exist in Athena. You can now use timestamp as the column type and there is no need to perform a conversion using something like from_unixtime(createdt/1000000).

1reaction
satishkothacommented, Sep 30, 2020

This is a bit complicated. Hudi uses spark converters to convert dataframe type into parquet type. Spark SchemaConverters converts timestamp to int64 with logical type ‘TIMESTAMP_MICROS’.

This is because int96 is no longer supported in parquet, especially parquet-avro module. In general, int96 is discouraged going forward.

To make timestamp work, we had to

  1. Change query engines to support reading parquet logical type. Example for presto. We did similar change for Hive. You probably need similar change in Athena
  2. Change DLASync/HiveSync to convert logical type TIMESTAMP_MICROS as hive type ‘timestamp’. PR here

Unfortunately, there is no clean workaround. As i mentioned, this is a bit complicated. Please don’t hesitate to ping me if you have any questions.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Resolve timestamp exceptions when querying a table in ...
When you query an Athena table with TIMESTAMP data, your query might fail with either of the following exceptions:.
Read more >
AWS Athena mis-interpreting timestamp column - Stack Overflow
For a TIMESTAMP column to work in Athena you need to use a specific format, which unfortunately is not ISO 8601.
Read more >
[GitHub] [hudi] l-jhon commented on issue #2123
[GitHub] [hudi] l-jhon commented on issue #2123: Timestamp not parsed correctly on Athena · 2021-05-10 Thread GitBox. l-jhon commented on issue #2123: URL: ......
Read more >
Date_Parse INVALID_FUNCTION_ARGUMENT: Invalid format
I am trying to convert string into timestamp however getting INVALID_FUNCTION_ARGUMENT: Invalid format: "2010-12-23 00:00:00" is malformed at " 00:00:00" ...
Read more >
Convert Timestamp To Date With Presto - Ahana Cloud
Answer: You are trying to compare a string literal in your query to a date type in your schema. The operator in the...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found