question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

[SUPPORT]failed to read timestamp column in version 0.7.0 even when HIVE_SUPPORT_TIMESTAMP is enabled

See original GitHub issue

Failed to read timestamp column after the hive sync is enabled

Here is the testing version list

hive = 3.1.2
hadoop = 3.2.2
spark = 3.0.1
hudi = 0.7.0

Here is the test application code snippet

import org.apache.spark.sql._
import org.apache.hudi.QuickstartUtils._
import scala.collection.JavaConversions._
import org.apache.spark.sql.SaveMode._
import org.apache.hudi.DataSourceReadOptions._
import org.apache.hudi.DataSourceWriteOptions._
import org.apache.hudi.config.HoodieWriteConfig._
import org.apache.spark.sql.functions._
import org.apache.hudi.QuickstartUtils._
import scala.collection.JavaConversions._
import org.apache.spark.sql.SaveMode._
import org.apache.hudi.DataSourceReadOptions
import org.apache.hudi.DataSourceWriteOptions
import org.apache.hudi.config.HoodieWriteConfig
import org.apache.hudi.hive.MultiPartKeysValueExtractor
import org.apache.spark.sql.functions._
import org.apache.hudi.keygen._
import org.apache.spark.sql.streaming._

case class Person(firstname:String, age:Int, gender:Int)
val personDF = List(Person("tom",45,1), Person("iris",44,0)).toDF.withColumn("ts",unix_timestamp).withColumn("insert_time",current_timestamp)
//val personDF2 = List(Person("peng",56,1), Person("iris",51,0),Person("jacky",25,1)).toDF.withColumn("ts",unix_timestamp).withColumn("insert_time",current_timestamp)

//personDF.write.mode(SaveMode.Overwrite).format("hudi").saveAsTable("employee")

val tableName = "employee"
val hudiCommonOptions = Map(
  "hoodie.compact.inline" -> "true",
  "hoodie.compact.inline.max.delta.commits" ->"5",
  "hoodie.base.path" -> s"/tmp/$tableName",
  "hoodie.table.name" -> tableName,
  "hoodie.datasource.write.table.type"->"MERGE_ON_READ",
  "hoodie.datasource.write.operation" -> "upsert",
  "hoodie.clean.async" -> "true"
)

val hudiHiveOptions = Map(
    DataSourceWriteOptions.HIVE_SYNC_ENABLED_OPT_KEY -> "true",
    DataSourceWriteOptions.HIVE_URL_OPT_KEY -> "jdbc:hive2://localhost:10000",
    DataSourceWriteOptions.HIVE_PARTITION_FIELDS_OPT_KEY -> "gender",
    DataSourceWriteOptions.HIVE_STYLE_PARTITIONING_OPT_KEY -> "true",
    "hoodie.datasource.hive_sync.support_timestamp"->"true",
    DataSourceWriteOptions.HIVE_TABLE_OPT_KEY -> tableName,
    DataSourceWriteOptions.HIVE_PARTITION_EXTRACTOR_CLASS_OPT_KEY -> classOf[MultiPartKeysValueExtractor].getName
)

val basePath = s"/tmp/$tableName"
personDF.write.format("hudi").
  option(PRECOMBINE_FIELD_OPT_KEY, "ts").
  option(RECORDKEY_FIELD_OPT_KEY, "firstname").
  option(PARTITIONPATH_FIELD_OPT_KEY, "gender").
  options(hudiCommonOptions).
  options(hudiHiveOptions).
  mode(SaveMode.Overwrite).
  save(basePath)

sql("select * from employee_rt").show(false)

The final query got failed and the following is the error message

174262 [Executor task launch worker for task 12017] ERROR org.apache.spark.executor.Executor  - Exception in task 0.0 in stage 31.0 (TID 12017)
java.lang.ClassCastException: org.apache.hadoop.io.LongWritable cannot be cast to org.apache.hadoop.hive.serde2.io.TimestampWritable
	at org.apache.hadoop.hive.serde2.objectinspector.primitive.WritableTimestampObjectInspector.getPrimitiveJavaObject(WritableTimestampObjectInspector.java:39)
	at org.apache.spark.sql.hive.HadoopTableReader$.$anonfun$fillObject$14(TableReader.scala:468)
	at org.apache.spark.sql.hive.HadoopTableReader$.$anonfun$fillObject$14$adapted(TableReader.scala:467)
	at org.apache.spark.sql.hive.HadoopTableReader$.$anonfun$fillObject$18(TableReader.scala:493)
	at scala.collection.Iterator$$anon$10.next(Iterator.scala:459)
	at scala.collection.Iterator$$anon$10.next(Iterator.scala:459)
	at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Source)
	at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
	at org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:729)
	at org.apache.spark.sql.execution.SparkPlan.$anonfun$getByteArrayRdd$1(SparkPlan.scala:340)
	at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2(RDD.scala:872)
	at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2$adapted(RDD.scala:872)
	at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:349)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:313)
	at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
	at org.apache.spark.scheduler.Task.run(Task.scala:127)
	at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:446)
	at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1377)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:449)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:748)

If read the hudi files directly, the result is correct as expected.

val employeeDF = spark.read.format("hudi").load("/tmp/employee")
employeeDF.show(false)

the result looks like this

+-------------------+--------------------+------------------+----------------------+------------------------------------------------------------------------+---------+---+------+----------+-----------------------+
|_hoodie_commit_time|_hoodie_commit_seqno|_hoodie_record_key|_hoodie_partition_path|_hoodie_file_name                                                       |firstname|age|gender|ts        |insert_time            |
+-------------------+--------------------+------------------+----------------------+------------------------------------------------------------------------+---------+---+------+----------+-----------------------+
|20210206160718     |20210206160718_0_1  |iris              |gender=0              |4fd7d48f-7828-4e77-97a5-a5202e32ad08-0_0-21-12008_20210206160718.parquet|iris     |44 |0     |1612598839|2021-02-06 16:07:19.251|
|20210206160718     |20210206160718_1_2  |tom               |gender=1              |c0014e5c-66d2-49fa-8af2-9a0b3df9bcf7-0_1-21-12009_20210206160718.parquet|tom      |45 |1     |1612598839|2021-02-06 16:07:19.251|
+-------------------+--------------------+------------------+----------------------+------------------------------------------------------------------------+---------+---+------+----------+-----------------------+

Issue Analytics

  • State:closed
  • Created 3 years ago
  • Comments:18 (14 by maintainers)

github_iconTop GitHub Comments

1reaction
BenjMaqcommented, May 19, 2021

Hey everyone, I’m also facing this issue. I see some of you guys already worked on some type of fix/workaround. How would you advise dealing with this?

I tried to add "hoodie.datasource.hive_sync.support_timestamp", true as an option but it does not look like it’s working.

Also, I’ve seen multiple Github issues raised about this issue, and in most of them there’s a link to @li36909’s comment above as a workaround. Unfortunately, as mentioned by @cdmikechenTimestampWritableV2 is a hive3 class”, and unfortunately we are also relying on Hive2.

Is there a workaround for Hive2? I’ll be happy to help with anything to move this forward (given my relatively low familiarity with Hudi…). Thanks a mil!

0reactions
Gatsby-Leecommented, Dec 2, 2021

@codope

Can you tell me where I can find the commit for this fix? And, do you know if there is any downside of setting this config? “hoodie.datasource.hive_sync.support_timestamp”: “true”,

Read more comments on GitHub >

github_iconTop Results From Across the Web

[GitHub] [hudi] li36909 commented on issue #2544: [SUPPORT ...
[GitHub] [hudi] li36909 commented on issue #2544: [SUPPORT]failed to read timestamp column in version 0.7.0 even when HIVE_SUPPORT_TIMESTAMP is enabled.
Read more >
TIMESTAMP - MariaDB Knowledge Base
This means that if the column is not explicitly assigned a value in an INSERT or UPDATE query, then MariaDB will automatically initialize...
Read more >
mysql - Why there can be only one TIMESTAMP column with ...
In MySQL the first TIMESTAMP column of a table gets both DEFAULT CURRENT_TIMESTAMP and ON UPDATE CURRENT_TIMESTAMP attribute, if no attributes are given...
Read more >
Resolve timestamp exceptions when querying a table in ...
When I query a column of TIMESTAMP data in my Amazon Athena table, I get an exception. Short description. When you query an...
Read more >
11.2.6 Automatic Initialization and Updating for TIMESTAMP ...
To specify automatic properties, use the DEFAULT CURRENT_TIMESTAMP and ON UPDATE CURRENT_TIMESTAMP clauses in column definitions. The order of the clauses does ...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found