question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

[SUPPORT] Hudi Spark DataSource saves TimestampType as bigInt

See original GitHub issue

Describe the problem you faced

It looks like org.apache.spark.sql.types.TimestampType when saved to hudi table gets converted to bigInt

To Reproduce

create dataframe with TimestampType

var seq = Seq((1, "2020-01-01 11:22:30", 2, 2))
var df = seq.toDF("pk", "time_string" , "partition", "sort_key")
df= df.withColumn("timestamp", col("time_string").cast(TimestampType))

preview dataframe

df.show
+---+-------------------+---------+--------+-------------------+
| pk|        time_string|partition|sort_key|          timestamp|
+---+-------------------+---------+--------+-------------------+
|  1|2020-01-01 11:22:30|        2|       2|2020-01-01 11:22:30|
+---+-------------------+---------+--------+-------------------+

save dataframe to hudi table

df.write.format("org.apache.hudi").options(hudiOptions).mode(SaveMode.Append).save("s3://location")

view hudi table

spark.sql("select * from testTable2").show

result, timestamp column is bigint

+-------------------+--------------------+------------------+----------------------+--------------------+---+-------------------+--------+----------------+---------+
|_hoodie_commit_time|_hoodie_commit_seqno|_hoodie_record_key|_hoodie_partition_path|   _hoodie_file_name| pk|        time_string|sort_key|       timestamp|partition|
+-------------------+--------------------+------------------+----------------------+--------------------+---+-------------------+--------+----------------+---------+
|     20210201004527|  20210201004527_0_1|              pk:1|                     2|2972ef96-279b-438...|  1|2020-01-01 11:22:30|       2|1577877750000000|        2|
+-------------------+--------------------+------------------+----------------------+--------------------+---+-------------------+--------+----------------+---------+

view schema

spark.sql("describe testTable2").show

result

+--------------------+---------+-------+
|            col_name|data_type|comment|
+--------------------+---------+-------+
| _hoodie_commit_time|   string|   null|
|_hoodie_commit_seqno|   string|   null|
|  _hoodie_record_key|   string|   null|
|_hoodie_partition...|   string|   null|
|   _hoodie_file_name|   string|   null|
|                  pk|      int|   null|
|         time_string|   string|   null|
|            sort_key|      int|   null|
|           timestamp|   bigint|   null|
|           partition|      int|   null|
|# Partition Infor...|         |       |
|          # col_name|data_type|comment|
|           partition|      int|   null|
+--------------------+---------+-------+

Environment Description

  • Hudi version : 0.7.0

  • Spark version :

  • Hive version :

  • Hadoop version :

  • Storage (HDFS/S3/GCS…) :S3

  • Running on Docker? (yes/no) : no

Additional context

full code snippet

    import org.apache.spark.sql.functions._
    import org.apache.spark.sql.types._
    import org.apache.hudi.hive.MultiPartKeysValueExtractor
    import org.apache.hudi.QuickstartUtils._
    import scala.collection.JavaConversions._
    import org.apache.spark.sql.SaveMode
    import org.apache.hudi.DataSourceReadOptions._
    import org.apache.hudi.DataSourceWriteOptions._
    import org.apache.hudi.DataSourceWriteOptions
    import org.apache.hudi.config.HoodieWriteConfig._
    import org.apache.hudi.config.HoodieWriteConfig
    import org.apache.hudi.keygen.ComplexKeyGenerator
    import org.apache.hudi.common.model.DefaultHoodieRecordPayload
    import org.apache.hadoop.hive.conf.HiveConf
    val hiveConf = new HiveConf()
    val hiveMetastoreURI = hiveConf.get("hive.metastore.uris").replaceAll("thrift://", "")
    val hiveServer2URI = hiveMetastoreURI.substring(0, hiveMetastoreURI.lastIndexOf(":"))
    var hudiOptions = Map[String,String](
      HoodieWriteConfig.TABLE_NAME → "testTable2",
      "hoodie.consistency.check.enabled"->"true",
      DataSourceWriteOptions.STORAGE_TYPE_OPT_KEY -> "COPY_ON_WRITE",
      DataSourceWriteOptions.RECORDKEY_FIELD_OPT_KEY -> "pk",
      DataSourceWriteOptions.KEYGENERATOR_CLASS_OPT_KEY -> classOf[ComplexKeyGenerator].getName,
      DataSourceWriteOptions.PARTITIONPATH_FIELD_OPT_KEY ->"partition",
      DataSourceWriteOptions.PRECOMBINE_FIELD_OPT_KEY -> "sort_key",
      DataSourceWriteOptions.HIVE_SYNC_ENABLED_OPT_KEY → "true",
      DataSourceWriteOptions.HIVE_TABLE_OPT_KEY → "testTable2",
      DataSourceWriteOptions.HIVE_PARTITION_FIELDS_OPT_KEY → "partition",
      DataSourceWriteOptions.HIVE_PARTITION_EXTRACTOR_CLASS_OPT_KEY → classOf[MultiPartKeysValueExtractor].getName,
      DataSourceWriteOptions.HIVE_URL_OPT_KEY ->s"jdbc:hive2://$hiveServer2URI:10000",
      "hoodie.payload.ordering.field" -> "sort_key",
      DataSourceWriteOptions.PAYLOAD_CLASS_OPT_KEY -> classOf[DefaultHoodieRecordPayload].getName
    )

//spark.sql("drop table if exists testTable1")
var seq = Seq((1, "2020-01-01 11:22:30", 2, 2))
var df = seq.toDF("pk", "time_string" , "partition", "sort_key")
df= df.withColumn("timestamp", col("time_string").cast(TimestampType))
df.show
df.write.format("org.apache.hudi").options(hudiOptions).mode(SaveMode.Append).save("s3://location")
spark.sql("select * from testTable2").show

Issue Analytics

  • State:open
  • Created 3 years ago
  • Reactions:2
  • Comments:25 (14 by maintainers)

github_iconTop GitHub Comments

3reactions
satishkothacommented, Feb 3, 2021

@rubenssoto AFAIK, athena is built on top of Presto. So you could ask them to apply above presto change. You can say this is needed for interpreting Parquet INT64 timestamp correctly.

3reactions
satishkothacommented, Feb 2, 2021

Hi

If you set support_timestamp property mentioned here, hudi will convert the field to timestamp type in hive.

Note that you need to verify compatibility of this with hive/presto/athena versions you are using. We made some changes to interpret the field correctly as timestamp. Refer to this change in presto for example. We did similar changes in our internal hive deployment.

Some more background: Hudi uses parquet-avro module which converts timestamp to INT64 with logical type TIMESTAMP_MICROS. Hive and other query engines expect timestamp to be in INT96 format. But INT96 is no longer supported. Recommended path forward is to deprecate int96 and change query engines to work with int64 type https://issues.apache.org/jira/browse/PARQUET-1883 has additional details.

Read more comments on GitHub >

github_iconTop Results From Across the Web

[GitHub] [hudi] nsivabalan commented on issue #2509: [SUPPORT ...
[GitHub] [hudi] nsivabalan commented on issue #2509: [SUPPORT] Hudi Spark DataSource saves TimestampType as bigInt · GitBox Sun, 19 Dec 2021 19:36:32 -0800....
Read more >
Spark Guide - Apache Hudi
This guide provides a quick peek at Hudi's capabilities using spark-shell. Using Spark datasources, we will walk through.
Read more >
Work with a Hudi dataset - Amazon EMR - AWS Documentation
Hudi supports inserting, updating, and deleting data in Hudi datasets through Spark. For more information, see Writing Hudi tables in Apache Hudi ......
Read more >
Building Streaming Data Lakes with Hudi and MinIO
Hudi's promise of providing optimizations that make analytic workloads faster for Apache Spark, Flink, Presto, Trino, and others dovetails ...
Read more >
Apache Spark job fails with Parquet column cannot be ...
spark.sql.execution.datasources.SchemaColumnConvertNotSupportedException. Cause. The vectorized Parquet reader is decoding the decimal type ...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found