Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

[BUG] Data corrupted in the timestamp field to 1970-01-01 19:45:30.000 after subsequent upsert run

See original GitHub issue

Tips before filing an issue

Have you gone through our FAQs?
Join the mailing list to engage in conversations and get faster support at dev-subscribe@hudi.apache.org.
If you have triaged this as a bug, then file an issue directly.

Describe the problem you faced

I am using hudi as a part of Glue jobs to manage data mutations in our data lake. The only modification that is being applied during the ETL from raw to bronze (where hudi is involved) is introduction of timestamp field that transforms date in seconds field from raw (ie 1602297930 -> 2020-10-10 02:45:30.000). I have noticed that for some entities, after subsequent ingestions, the timestamp field value becomes corrupted (even for the records that are not present in the update). I managed to isolate the bulk_insert as a source of the issue. The timestamp field gets reset to 1970-01-01 19:45:30.000. Just to note, when insert is being used in place of bulk_insert for the initial ingest, the issue is not occurring.

To Reproduce

Steps to reproduce the behavior: Unfortunately it is pretty tricky to reproduce and it is only happening to 2 types of entities out of 16 we are using. We are moving stripe data into the data lake and the entities that do get issues are (https://stripe.com/docs/api/invoices and https://stripe.com/docs/api/customers

Create AWS Glue ETL job with hudi config hudi_options = { "hoodie.table.create.schema": "default", "hoodie.table.name": tableName, "hoodie.datasource.write.recordkey.field": "id", "hoodie.datasource.write.keygenerator.class": "org.apache.hudi.keygen.ComplexKeyGenerator", "hoodie.datasource.write.partitionpath.field": "field_1, field_2", "hoodie.datasource.hive_sync.partition_extractor_class": "org.apache.hudi.hive.MultiPartKeysValueExtractor", "hoodie.datasource.hive_sync.partition_fields": "field_1, field_2", 'hoodie.cleaner.policy': 'KEEP_LATEST_COMMITS', 'hoodie.cleaner.commits.retained': 10, "hoodie.datasource.write.table.name": tableName, "hoodie.datasource.write.operation": "bulk_insert", "hoodie.parquet.compression.codec": "snappy", "hoodie.datasource.hive_sync.enable": "true", "hoodie.datasource.hive_sync.use_jdbc": "false", 'hoodie.datasource.hive_sync.support_timestamp': "true", "hoodie.datasource.write.precombine.field": "lastupdate", "hoodie.datasource.hive_sync.database": f"{database_name}", "hoodie.datasource.hive_sync.table": tableName, } The transformation of the raw data looks as following: sparkDF = dfc.select(list(dfc.keys())[0]).toDF() transformed = sparkDF.select("*", from_unixtime(col("created")).cast("timestamp").alias("created_ts")) resolvechoice4 = DynamicFrame.fromDF(transformed, glueContext, "transformedDF") return DynamicFrameCollection({"CustomTransform0": resolvechoice4}, glueContext)
Write data to data lake sparkDF.write.format("hudi").options(**hudi_options).mode("overwrite").save( basePath )
Update the previously mentioned hudi config with write operation to become upsert. Update spark mode to append
Trigger Glue job again with updates

Expected behavior

I expect the data to remain consistent and the fields to represent the relevant info

Environment Description

Hudi version : 0.9.0
Spark version : 3.1
Hive version : Using Glue data catalogue as hive metastore
Hadoop version : N/A
Glue version: 3
Storage (HDFS/S3/GCS…) : S3
Running on Docker? (yes/no) :

Additional context

Add any other context about the problem here.

Stacktrace

There is no error

Issue Analytics

State:
Created 2 years ago
Comments:14 (7 by maintainers)

Top GitHub Comments

1reaction

rafcis02commented, Jan 28, 2022

I’ve tried it for BULK_INSERT and UPSERT as well, bot nothing works for me.

I prepared sample test job of that so you can reproduce it or just review it (I hope I just misconfigured it 😀). Upsert operation corrupts timestamps, no matter if I set this option or not - I tested it using DataFrame writer and SQL MERGE INTO.

AWS Glue 2.0(Spark 2.4) Hudi 0.10.1

import org.apache.hudi.DataSourceWriteOptions._
import org.apache.hudi.config.HoodieWriteConfig.TBL_NAME
import org.apache.hudi.hive.MultiPartKeysValueExtractor
import org.apache.hudi.hive.ddl.HiveSyncMode
import org.apache.hudi.hive.util.ConfigUtils
import org.apache.hudi.keygen.ComplexKeyGenerator
import org.apache.log4j.{Level, Logger}
import org.apache.spark.sql.functions.{current_timestamp, expr, lit}
import org.apache.spark.sql.hudi.HoodieOptionConfig.{SQL_KEY_PRECOMBINE_FIELD, SQL_KEY_TABLE_PRIMARY_KEY, SQL_KEY_TABLE_TYPE}
import org.apache.spark.sql.hudi.{HoodieOptionConfig, HoodieSparkSessionExtension}
import org.apache.spark.sql.{SaveMode, SparkSession}

import scala.collection.JavaConverters.mapAsJavaMapConverter

case class HudiTableOptions(tableName: String,
                            databaseName: String,
                            operationType: String,
                            partitionColumns: List[String],
                            recordKeyColumns: List[String],
                            preCombineColumn: String) {

  def hudiTableOptions = Map(
    TABLE_TYPE.key() -> COW_TABLE_TYPE_OPT_VAL,
    OPERATION.key() -> operationType,
    TBL_NAME.key() -> tableName,
    RECORDKEY_FIELD.key() -> recordKeyColumns.mkString(","),
    PARTITIONPATH_FIELD.key() -> partitionColumns.mkString(","),
    KEYGENERATOR_CLASS_NAME.key() -> classOf[ComplexKeyGenerator].getName,
    PRECOMBINE_FIELD.key() -> preCombineColumn,
    URL_ENCODE_PARTITIONING.key() -> false.toString,
    KEYGENERATOR_CONSISTENT_LOGICAL_TIMESTAMP_ENABLED.key() -> true.toString
  )

  def hiveTableProperties = Map(
    SQL_KEY_TABLE_TYPE.sqlKeyName -> HoodieOptionConfig.SQL_VALUE_TABLE_TYPE_COW,
    SQL_KEY_TABLE_PRIMARY_KEY.sqlKeyName -> hudiTableOptions(RECORDKEY_FIELD.key()),
    SQL_KEY_PRECOMBINE_FIELD.sqlKeyName -> hudiTableOptions(PRECOMBINE_FIELD.key())
  )

  def hiveTableOptions = Map(
    HIVE_SYNC_MODE.key() -> HiveSyncMode.HMS.name(),
    HIVE_SYNC_ENABLED.key() -> true.toString,
    HIVE_DATABASE.key() -> databaseName,
    HIVE_TABLE.key() -> hudiTableOptions(TBL_NAME.key()),
    HIVE_PARTITION_FIELDS.key() -> hudiTableOptions(PARTITIONPATH_FIELD.key()),
    HIVE_PARTITION_EXTRACTOR_CLASS.key() -> classOf[MultiPartKeysValueExtractor].getName,
    HIVE_STYLE_PARTITIONING.key() -> true.toString,
    HIVE_SUPPORT_TIMESTAMP_TYPE.key() -> true.toString,
    HIVE_TABLE_SERDE_PROPERTIES.key() -> ConfigUtils.configToString(hiveTableProperties.asJava)
  )

  def writerOptions: Map[String, String] = hudiTableOptions ++ hiveTableOptions
}

object GlueApp {
  def main(args: Array[String]): Unit = {
    Logger.getRootLogger.setLevel(Level.WARN)
    Logger.getLogger("org.apache.spark").setLevel(Level.ERROR)
    Logger.getLogger("org.apache.hudi").setLevel(Level.INFO)

    val spark: SparkSession = SparkSession.builder()
      .appName("Test")
      .config("spark.serializer", "org.apache.spark.serializer.KryoSerializer")
      .config("hive.metastore.glue.catalogid", "****")
      .withExtensions(new HoodieSparkSessionExtension())
      .enableHiveSupport()
      .getOrCreate()

    import spark.implicits._

    val tableName = "hudi_test"
    val databaseName = "test_database"
    val path = "s3://s3bucketname/hudi-test/"
    val tableOptions = HudiTableOptions(tableName, databaseName, BULK_INSERT_OPERATION_OPT_VAL, List("year"), List("id"), "ts")

    val dataFrameForBulkInsert = Range(0, 10000).toDF("id")
      .withColumn("year", lit(2022))
      .withColumn("ts", current_timestamp() - expr("INTERVAL 100 DAYS"))
      .withColumn("other", current_timestamp() - expr("INTERVAL 150 DAYS"))

    val dataFrameForUpsert = Range(5000, 15000).toDF("id")
      .withColumn("year", lit(2022))
      .withColumn("ts", current_timestamp() - expr("INTERVAL 10 DAYS"))
      .withColumn("other", current_timestamp() - expr("INTERVAL 50 DAYS"))

    // ------------------------ BULK INSERT ------------------------------------------
    dataFrameForBulkInsert.write
      .format("org.apache.hudi")
      .options(tableOptions.writerOptions)
      .mode(SaveMode.Overwrite)
      .save(path)
    // -----------------------------------------------------------------------------

    Thread.sleep(10 * 1000)

    // --------------------- UPSERT Spark DataFrame Writer --------------------------
//    dataFrameForUpsert
//      .write
//      .format("org.apache.hudi")
//      .options(tableOptions.copy(operationType = UPSERT_OPERATION_OPT_VAL).writerOptions)
//      .mode(SaveMode.Append)
//      .save(path)
    // ------------------------------------------------------------------------------

    // --------------------- UPSERT SQL MERGE INTO-----------------------------------
    val mergeIntoStatement =
      s"""MERGE INTO $databaseName.$tableName AS t
         | USING source_data_set s
         | ON t.id=s.id
         | WHEN MATCHED THEN UPDATE SET *
         | WHEN NOT MATCHED THEN INSERT *
         |""".stripMargin
    dataFrameForUpsert.createTempView("source_data_set")
    spark.sql(mergeIntoStatement)
    // -------------------------------------------------------------------------------
  }
}

0reactions

jasondavindevcommented, May 4, 2022

https://github.com/apache/hudi/issues/5469 The 0.11.0 version fixed it.

Top Results From Across the Web

mysql - When is a timestamp (auto) updated? - Stack Overflow

DEFAULT CURRENT_TIMESTAMP means that any INSERT without an explicit time stamp setting uses the current time. Likewise, ON UPDATE CURRENT_TIMESTAMP means that ...