question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

[BUG] Data corrupted in the timestamp field to 1970-01-01 19:45:30.000 after subsequent upsert run

See original GitHub issue

Tips before filing an issue

  • Have you gone through our FAQs?

  • Join the mailing list to engage in conversations and get faster support at dev-subscribe@hudi.apache.org.

  • If you have triaged this as a bug, then file an issue directly.

Describe the problem you faced

I am using hudi as a part of Glue jobs to manage data mutations in our data lake. The only modification that is being applied during the ETL from raw to bronze (where hudi is involved) is introduction of timestamp field that transforms date in seconds field from raw (ie 1602297930 -> 2020-10-10 02:45:30.000). I have noticed that for some entities, after subsequent ingestions, the timestamp field value becomes corrupted (even for the records that are not present in the update). I managed to isolate the bulk_insert as a source of the issue. The timestamp field gets reset to 1970-01-01 19:45:30.000. Just to note, when insert is being used in place of bulk_insert for the initial ingest, the issue is not occurring.

To Reproduce

Steps to reproduce the behavior: Unfortunately it is pretty tricky to reproduce and it is only happening to 2 types of entities out of 16 we are using. We are moving stripe data into the data lake and the entities that do get issues are (https://stripe.com/docs/api/invoices and https://stripe.com/docs/api/customers

  1. Create AWS Glue ETL job with hudi config hudi_options = { "hoodie.table.create.schema": "default", "hoodie.table.name": tableName, "hoodie.datasource.write.recordkey.field": "id", "hoodie.datasource.write.keygenerator.class": "org.apache.hudi.keygen.ComplexKeyGenerator", "hoodie.datasource.write.partitionpath.field": "field_1, field_2", "hoodie.datasource.hive_sync.partition_extractor_class": "org.apache.hudi.hive.MultiPartKeysValueExtractor", "hoodie.datasource.hive_sync.partition_fields": "field_1, field_2", 'hoodie.cleaner.policy': 'KEEP_LATEST_COMMITS', 'hoodie.cleaner.commits.retained': 10, "hoodie.datasource.write.table.name": tableName, "hoodie.datasource.write.operation": "bulk_insert", "hoodie.parquet.compression.codec": "snappy", "hoodie.datasource.hive_sync.enable": "true", "hoodie.datasource.hive_sync.use_jdbc": "false", 'hoodie.datasource.hive_sync.support_timestamp': "true", "hoodie.datasource.write.precombine.field": "lastupdate", "hoodie.datasource.hive_sync.database": f"{database_name}", "hoodie.datasource.hive_sync.table": tableName, } The transformation of the raw data looks as following: sparkDF = dfc.select(list(dfc.keys())[0]).toDF() transformed = sparkDF.select("*", from_unixtime(col("created")).cast("timestamp").alias("created_ts")) resolvechoice4 = DynamicFrame.fromDF(transformed, glueContext, "transformedDF") return DynamicFrameCollection({"CustomTransform0": resolvechoice4}, glueContext)

  2. Write data to data lake sparkDF.write.format("hudi").options(**hudi_options).mode("overwrite").save( basePath )

  3. Update the previously mentioned hudi config with write operation to become upsert. Update spark mode to append

  4. Trigger Glue job again with updates

Expected behavior

I expect the data to remain consistent and the fields to represent the relevant info

Environment Description

  • Hudi version : 0.9.0

  • Spark version : 3.1

  • Hive version : Using Glue data catalogue as hive metastore

  • Hadoop version : N/A

  • Glue version: 3

  • Storage (HDFS/S3/GCS…) : S3

  • Running on Docker? (yes/no) :

Additional context

Add any other context about the problem here.

Stacktrace

There is no error

Issue Analytics

  • State:closed
  • Created 2 years ago
  • Comments:14 (7 by maintainers)

github_iconTop GitHub Comments

1reaction
rafcis02commented, Jan 28, 2022

I’ve tried it for BULK_INSERT and UPSERT as well, bot nothing works for me.

I prepared sample test job of that so you can reproduce it or just review it (I hope I just misconfigured it 😀). Upsert operation corrupts timestamps, no matter if I set this option or not - I tested it using DataFrame writer and SQL MERGE INTO.

AWS Glue 2.0(Spark 2.4) Hudi 0.10.1

import org.apache.hudi.DataSourceWriteOptions._
import org.apache.hudi.config.HoodieWriteConfig.TBL_NAME
import org.apache.hudi.hive.MultiPartKeysValueExtractor
import org.apache.hudi.hive.ddl.HiveSyncMode
import org.apache.hudi.hive.util.ConfigUtils
import org.apache.hudi.keygen.ComplexKeyGenerator
import org.apache.log4j.{Level, Logger}
import org.apache.spark.sql.functions.{current_timestamp, expr, lit}
import org.apache.spark.sql.hudi.HoodieOptionConfig.{SQL_KEY_PRECOMBINE_FIELD, SQL_KEY_TABLE_PRIMARY_KEY, SQL_KEY_TABLE_TYPE}
import org.apache.spark.sql.hudi.{HoodieOptionConfig, HoodieSparkSessionExtension}
import org.apache.spark.sql.{SaveMode, SparkSession}

import scala.collection.JavaConverters.mapAsJavaMapConverter

case class HudiTableOptions(tableName: String,
                            databaseName: String,
                            operationType: String,
                            partitionColumns: List[String],
                            recordKeyColumns: List[String],
                            preCombineColumn: String) {

  def hudiTableOptions = Map(
    TABLE_TYPE.key() -> COW_TABLE_TYPE_OPT_VAL,
    OPERATION.key() -> operationType,
    TBL_NAME.key() -> tableName,
    RECORDKEY_FIELD.key() -> recordKeyColumns.mkString(","),
    PARTITIONPATH_FIELD.key() -> partitionColumns.mkString(","),
    KEYGENERATOR_CLASS_NAME.key() -> classOf[ComplexKeyGenerator].getName,
    PRECOMBINE_FIELD.key() -> preCombineColumn,
    URL_ENCODE_PARTITIONING.key() -> false.toString,
    KEYGENERATOR_CONSISTENT_LOGICAL_TIMESTAMP_ENABLED.key() -> true.toString
  )

  def hiveTableProperties = Map(
    SQL_KEY_TABLE_TYPE.sqlKeyName -> HoodieOptionConfig.SQL_VALUE_TABLE_TYPE_COW,
    SQL_KEY_TABLE_PRIMARY_KEY.sqlKeyName -> hudiTableOptions(RECORDKEY_FIELD.key()),
    SQL_KEY_PRECOMBINE_FIELD.sqlKeyName -> hudiTableOptions(PRECOMBINE_FIELD.key())
  )

  def hiveTableOptions = Map(
    HIVE_SYNC_MODE.key() -> HiveSyncMode.HMS.name(),
    HIVE_SYNC_ENABLED.key() -> true.toString,
    HIVE_DATABASE.key() -> databaseName,
    HIVE_TABLE.key() -> hudiTableOptions(TBL_NAME.key()),
    HIVE_PARTITION_FIELDS.key() -> hudiTableOptions(PARTITIONPATH_FIELD.key()),
    HIVE_PARTITION_EXTRACTOR_CLASS.key() -> classOf[MultiPartKeysValueExtractor].getName,
    HIVE_STYLE_PARTITIONING.key() -> true.toString,
    HIVE_SUPPORT_TIMESTAMP_TYPE.key() -> true.toString,
    HIVE_TABLE_SERDE_PROPERTIES.key() -> ConfigUtils.configToString(hiveTableProperties.asJava)
  )

  def writerOptions: Map[String, String] = hudiTableOptions ++ hiveTableOptions
}

object GlueApp {
  def main(args: Array[String]): Unit = {
    Logger.getRootLogger.setLevel(Level.WARN)
    Logger.getLogger("org.apache.spark").setLevel(Level.ERROR)
    Logger.getLogger("org.apache.hudi").setLevel(Level.INFO)

    val spark: SparkSession = SparkSession.builder()
      .appName("Test")
      .config("spark.serializer", "org.apache.spark.serializer.KryoSerializer")
      .config("hive.metastore.glue.catalogid", "****")
      .withExtensions(new HoodieSparkSessionExtension())
      .enableHiveSupport()
      .getOrCreate()

    import spark.implicits._

    val tableName = "hudi_test"
    val databaseName = "test_database"
    val path = "s3://s3bucketname/hudi-test/"
    val tableOptions = HudiTableOptions(tableName, databaseName, BULK_INSERT_OPERATION_OPT_VAL, List("year"), List("id"), "ts")

    val dataFrameForBulkInsert = Range(0, 10000).toDF("id")
      .withColumn("year", lit(2022))
      .withColumn("ts", current_timestamp() - expr("INTERVAL 100 DAYS"))
      .withColumn("other", current_timestamp() - expr("INTERVAL 150 DAYS"))

    val dataFrameForUpsert = Range(5000, 15000).toDF("id")
      .withColumn("year", lit(2022))
      .withColumn("ts", current_timestamp() - expr("INTERVAL 10 DAYS"))
      .withColumn("other", current_timestamp() - expr("INTERVAL 50 DAYS"))

    // ------------------------ BULK INSERT ------------------------------------------
    dataFrameForBulkInsert.write
      .format("org.apache.hudi")
      .options(tableOptions.writerOptions)
      .mode(SaveMode.Overwrite)
      .save(path)
    // -----------------------------------------------------------------------------

    Thread.sleep(10 * 1000)

    // --------------------- UPSERT Spark DataFrame Writer --------------------------
//    dataFrameForUpsert
//      .write
//      .format("org.apache.hudi")
//      .options(tableOptions.copy(operationType = UPSERT_OPERATION_OPT_VAL).writerOptions)
//      .mode(SaveMode.Append)
//      .save(path)
    // ------------------------------------------------------------------------------

    // --------------------- UPSERT SQL MERGE INTO-----------------------------------
    val mergeIntoStatement =
      s"""MERGE INTO $databaseName.$tableName AS t
         | USING source_data_set s
         | ON t.id=s.id
         | WHEN MATCHED THEN UPDATE SET *
         | WHEN NOT MATCHED THEN INSERT *
         |""".stripMargin
    dataFrameForUpsert.createTempView("source_data_set")
    spark.sql(mergeIntoStatement)
    // -------------------------------------------------------------------------------
  }
}
0reactions
jasondavindevcommented, May 4, 2022

https://github.com/apache/hudi/issues/5469 The 0.11.0 version fixed it.

Read more comments on GitHub >

github_iconTop Results From Across the Web

mysql - When is a timestamp (auto) updated? - Stack Overflow
DEFAULT CURRENT_TIMESTAMP means that any INSERT without an explicit time stamp setting uses the current time. Likewise, ON UPDATE CURRENT_TIMESTAMP means that ...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found