Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

[SUPPORT] Hudi Spark DataSource saves TimestampType as bigInt

See original GitHub issue

Describe the problem you faced

It looks like org.apache.spark.sql.types.TimestampType when saved to hudi table gets converted to bigInt

To Reproduce

create dataframe with TimestampType

var seq = Seq((1, "2020-01-01 11:22:30", 2, 2))
var df = seq.toDF("pk", "time_string" , "partition", "sort_key")
df= df.withColumn("timestamp", col("time_string").cast(TimestampType))

preview dataframe

df.show

+---+-------------------+---------+--------+-------------------+
| pk|        time_string|partition|sort_key|          timestamp|
+---+-------------------+---------+--------+-------------------+
|  1|2020-01-01 11:22:30|        2|       2|2020-01-01 11:22:30|
+---+-------------------+---------+--------+-------------------+

save dataframe to hudi table

df.write.format("org.apache.hudi").options(hudiOptions).mode(SaveMode.Append).save("s3://location")

view hudi table

spark.sql("select * from testTable2").show

result, timestamp column is bigint

+-------------------+--------------------+------------------+----------------------+--------------------+---+-------------------+--------+----------------+---------+
|_hoodie_commit_time|_hoodie_commit_seqno|_hoodie_record_key|_hoodie_partition_path|   _hoodie_file_name| pk|        time_string|sort_key|       timestamp|partition|
+-------------------+--------------------+------------------+----------------------+--------------------+---+-------------------+--------+----------------+---------+
|     20210201004527|  20210201004527_0_1|              pk:1|                     2|2972ef96-279b-438...|  1|2020-01-01 11:22:30|       2|1577877750000000|        2|
+-------------------+--------------------+------------------+----------------------+--------------------+---+-------------------+--------+----------------+---------+

view schema

spark.sql("describe testTable2").show

result

+--------------------+---------+-------+
|            col_name|data_type|comment|
+--------------------+---------+-------+
| _hoodie_commit_time|   string|   null|
|_hoodie_commit_seqno|   string|   null|
|  _hoodie_record_key|   string|   null|
|_hoodie_partition...|   string|   null|
|   _hoodie_file_name|   string|   null|
|                  pk|      int|   null|
|         time_string|   string|   null|
|            sort_key|      int|   null|
|           timestamp|   bigint|   null|
|           partition|      int|   null|
|# Partition Infor...|         |       |
|          # col_name|data_type|comment|
|           partition|      int|   null|
+--------------------+---------+-------+

Environment Description

Hudi version : 0.7.0
Spark version :
Hive version :
Hadoop version :
Storage (HDFS/S3/GCS…) :S3
Running on Docker? (yes/no) : no

Additional context

full code snippet

    import org.apache.spark.sql.functions._
    import org.apache.spark.sql.types._
    import org.apache.hudi.hive.MultiPartKeysValueExtractor
    import org.apache.hudi.QuickstartUtils._
    import scala.collection.JavaConversions._
    import org.apache.spark.sql.SaveMode
    import org.apache.hudi.DataSourceReadOptions._
    import org.apache.hudi.DataSourceWriteOptions._
    import org.apache.hudi.DataSourceWriteOptions
    import org.apache.hudi.config.HoodieWriteConfig._
    import org.apache.hudi.config.HoodieWriteConfig
    import org.apache.hudi.keygen.ComplexKeyGenerator
    import org.apache.hudi.common.model.DefaultHoodieRecordPayload
    import org.apache.hadoop.hive.conf.HiveConf
    val hiveConf = new HiveConf()
    val hiveMetastoreURI = hiveConf.get("hive.metastore.uris").replaceAll("thrift://", "")
    val hiveServer2URI = hiveMetastoreURI.substring(0, hiveMetastoreURI.lastIndexOf(":"))
    var hudiOptions = Map[String,String](
      HoodieWriteConfig.TABLE_NAME → "testTable2",
      "hoodie.consistency.check.enabled"->"true",
      DataSourceWriteOptions.STORAGE_TYPE_OPT_KEY -> "COPY_ON_WRITE",
      DataSourceWriteOptions.RECORDKEY_FIELD_OPT_KEY -> "pk",
      DataSourceWriteOptions.KEYGENERATOR_CLASS_OPT_KEY -> classOf[ComplexKeyGenerator].getName,
      DataSourceWriteOptions.PARTITIONPATH_FIELD_OPT_KEY ->"partition",
      DataSourceWriteOptions.PRECOMBINE_FIELD_OPT_KEY -> "sort_key",
      DataSourceWriteOptions.HIVE_SYNC_ENABLED_OPT_KEY → "true",
      DataSourceWriteOptions.HIVE_TABLE_OPT_KEY → "testTable2",
      DataSourceWriteOptions.HIVE_PARTITION_FIELDS_OPT_KEY → "partition",
      DataSourceWriteOptions.HIVE_PARTITION_EXTRACTOR_CLASS_OPT_KEY → classOf[MultiPartKeysValueExtractor].getName,
      DataSourceWriteOptions.HIVE_URL_OPT_KEY ->s"jdbc:hive2://$hiveServer2URI:10000",
      "hoodie.payload.ordering.field" -> "sort_key",
      DataSourceWriteOptions.PAYLOAD_CLASS_OPT_KEY -> classOf[DefaultHoodieRecordPayload].getName
    )

//spark.sql("drop table if exists testTable1")
var seq = Seq((1, "2020-01-01 11:22:30", 2, 2))
var df = seq.toDF("pk", "time_string" , "partition", "sort_key")
df= df.withColumn("timestamp", col("time_string").cast(TimestampType))
df.show
df.write.format("org.apache.hudi").options(hudiOptions).mode(SaveMode.Append).save("s3://location")
spark.sql("select * from testTable2").show

Issue Analytics

State:
Created 3 years ago
Reactions:2
Comments:25 (14 by maintainers)

Top GitHub Comments

3reactions

satishkothacommented, Feb 3, 2021

@rubenssoto AFAIK, athena is built on top of Presto. So you could ask them to apply above presto change. You can say this is needed for interpreting Parquet INT64 timestamp correctly.

3reactions

satishkothacommented, Feb 2, 2021

If you set support_timestamp property mentioned here, hudi will convert the field to timestamp type in hive.

Note that you need to verify compatibility of this with hive/presto/athena versions you are using. We made some changes to interpret the field correctly as timestamp. Refer to this change in presto for example. We did similar changes in our internal hive deployment.

Some more background: Hudi uses parquet-avro module which converts timestamp to INT64 with logical type TIMESTAMP_MICROS. Hive and other query engines expect timestamp to be in INT96 format. But INT96 is no longer supported. Recommended path forward is to deprecate int96 and change query engines to work with int64 type https://issues.apache.org/jira/browse/PARQUET-1883 has additional details.

Top Results From Across the Web

[GitHub] [hudi] nsivabalan commented on issue #2509: [SUPPORT ...

[GitHub] [hudi] nsivabalan commented on issue #2509: [SUPPORT] Hudi Spark DataSource saves TimestampType as bigInt · GitBox Sun, 19 Dec 2021 19:36:32 -0800....

Spark Guide - Apache Hudi

This guide provides a quick peek at Hudi's capabilities using spark-shell. Using Spark datasources, we will walk through.

Work with a Hudi dataset - Amazon EMR - AWS Documentation

Hudi supports inserting, updating, and deleting data in Hudi datasets through Spark. For more information, see Writing Hudi tables in Apache Hudi ......

Building Streaming Data Lakes with Hudi and MinIO

Hudi's promise of providing optimizations that make analytic workloads faster for Apache Spark, Flink, Presto, Trino, and others dovetails ...

Apache Spark job fails with Parquet column cannot be ...

spark.sql.execution.datasources.SchemaColumnConvertNotSupportedException. Cause. The vectorized Parquet reader is decoding the decimal type ...