Timestamp not parsed correctly on Athena
See original GitHub issueThe data type of the timestamp columns are flawed inspite of being correct in the DataFrame. The schema of the DataFrame is:
root
|-- branch_id: long (nullable = true)
|-- comment: string (nullable = true)
|-- created_at: timestamp (nullable = true)
...
The output of the DataFrame show():
--------------+--------------------+-----------------------+
|branch_id| comment| created_at|
--------------+--------------------+-----------------------+
| 13501| |2017-05-09 08:21:35|
| 14081| |2017-05-09 08:53:29|
...
--------------+--------------------+-----------------------+
The output in Athena after storing the DataFrame in hudi format.
--------------+--------------------+-----------------------+
|branch_id| comment| created_at|
--------------+--------------------+-----------------------+
| 13501| | +49134-01-07 05:30:00.000|
| 14081| | +49153-08-06 07:20:00.000|
...
--------------+--------------------+-----------------------+
Code to write the DataFrame “main_df” in Hudi format:
hudi_options = {
'hoodie.table.name': table.name,
'hoodie.datasource.write.recordkey.field': table.primary_key,
'hoodie.datasource.write.partitionpath.field': partition_by,
'hoodie.datasource.write.table.name': table.name,
'hoodie.datasource.write.operation': "upsert",
'hoodie.datasource.write.precombine.field': "ts_ms",
'hoodie.upsert.shuffle.parallelism': 2,
'hoodie.insert.shuffle.parallelism': 2
}
main_df.write.format("hudi"). \
options(**hudi_options). \
mode("append"). \
save(desturl)
The Issue is that Athena recognises int96 at timestamps and not int64 which is given by hudi. What is the fix for this?
Issue Analytics
- State:
- Created 3 years ago
- Reactions:3
- Comments:23 (10 by maintainers)
Top Results From Across the Web
Resolve timestamp exceptions when querying a table in ...
When you query an Athena table with TIMESTAMP data, your query might fail with either of the following exceptions:.
Read more >AWS Athena mis-interpreting timestamp column - Stack Overflow
For a TIMESTAMP column to work in Athena you need to use a specific format, which unfortunately is not ISO 8601.
Read more >[GitHub] [hudi] l-jhon commented on issue #2123
[GitHub] [hudi] l-jhon commented on issue #2123: Timestamp not parsed correctly on Athena · 2021-05-10 Thread GitBox. l-jhon commented on issue #2123: URL: ......
Read more >Date_Parse INVALID_FUNCTION_ARGUMENT: Invalid format
I am trying to convert string into timestamp however getting INVALID_FUNCTION_ARGUMENT: Invalid format: "2010-12-23 00:00:00" is malformed at " 00:00:00" ...
Read more >Convert Timestamp To Date With Presto - Ahana Cloud
Answer: You are trying to compare a string literal in your query to a date type in your schema. The operator in the...
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
Just FYI, a similar fix has been done after that ticket was created, and this issue should no longer exist in Athena. You can now use timestamp as the column type and there is no need to perform a conversion using something like
from_unixtime(createdt/1000000)
.This is a bit complicated. Hudi uses spark converters to convert dataframe type into parquet type. Spark SchemaConverters converts timestamp to int64 with logical type ‘TIMESTAMP_MICROS’.
This is because int96 is no longer supported in parquet, especially parquet-avro module. In general, int96 is discouraged going forward.
To make timestamp work, we had to
Unfortunately, there is no clean workaround. As i mentioned, this is a bit complicated. Please don’t hesitate to ping me if you have any questions.