Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

[SUPPORT] HUDI MOR/COW tuning with spark structured streaming

See original GitHub issue

Hello,

Quick explanation of the situation, i’ve multiples kafka topics (one for each table) containing CDC events sent by debezium. I need to read in streaming thoses changes, and update corresponding table in HIVE (1.2).

Tables could be huge (200M+ events) but CDC are not very huge, lets says maximum few thousands per day per table. so the first sync could be painful, but once its done, CDC could be pretty ‘‘light’’.

I first tried DeltaStream but i need to do specific operation, such as filtering data, converting date so i’d rather do it in custom spark code to get more flexibility.

I decided to use structured streaming to connect to all my topics (two choice here: one stream connected to severeals topics, or on stream per topic)

1 -> this solution need a groupby topic to be able to save data in corresponding table (not simple)

spark.readStream.format("kafka").options(xxxxx).option("subscribe","all-topics")

2 -> This solution is easier to manage but its create lots of stream (more vcpu)

for (table <- tables ) {
spark.readStream.format("kafka").options(xxxxx).option("subscribe",table)
}

After this, i use writeStream in hudi format every 2min to write received data to corresponding table :

writeStream
        .trigger(Trigger.ProcessingTime("120 seconds"))
        .foreachBatch((df,id) => {
                   df.write.format("org.apache.hudi")
                    .options(HudiUtils.getHudiOptions(table))
                    .options(HudiUtils.getHiveSyncOptions(table.name))
                    .options(HudiUtils.getCompactionOptions)
                    .mode(SaveMode.Append)
                    .save(config.pathConf.outputPath + "/out/" + table.name )
        })
        .option("checkpointLocation",config.pathConf.outputPath + "/checkpoint/" + table.name)
        .start()

Here are my configuration :

For HUDI :

Map(
      TABLE_TYPE_OPT_KEY -> "COPY_ON_WRITE",
      PRECOMBINE_FIELD_OPT_KEY -> "ts_ms",
      RECORDKEY_FIELD_OPT_KEY -> table.pk,
      OPERATION_OPT_KEY -> "upsert",
      KEYGENERATOR_CLASS_OPT_KEY-> "org.apache.hudi.keygen.NonpartitionedKeyGenerator",
      TABLE_NAME_OPT_KEY -> ("hudi_" + table.name),
      "hoodie.table.name" -> ("hudi_" + table.name),
      "hoodie.upsert.shuffle.parallelism" -> "2",

For Compaction :

    Map(
      "hoodie.compact.inline" -> "true",
      "hoodie.compact.inline.max.delta.commits" -> "1",
      "hoodie.cleaner.commits.retained" -> "1",
      "hoodie.cleaner.fileversions.retained" -> "1",
      "hoodie.clean.async" -> "false",
      "hoodie.clean.automatic" ->"true",
      "hoodie.parquet.compression.codec" -> "snappy"
    )

For Spark :

    .config("spark.executor.cores", "3")
      .config("spark.executor.instances","5")
      .config("spark.executor.memory", "2g")
      .config("spark.rdd.compress","true")
      .config("spark.shuffle.service.enabled","true")
      .config("spark.sql.hive.convertMetastoreParquet","false")
      .config("spark.kryoserializer.buffer.max","512m")
      .config("spark.driver.memoryOverhead","1024")
      .config("spark.executor.memoryOverhead","3072")
      .config("spark.max.executor.failures","100")

Expected behavior I tried this code with a unique topic with 24K records, and its takes more than 5min to write to HDFS. with multiple topics its hangs and can be pretty long…

Capture d’écran 2020-10-14 à 12 04 15

Environment Description

Hudi version : 0.6.0
Spark version : 2.4.6
Hive version : 1.2
Hadoop version : 2.7
Storage (HDFS/S3/GCS…) : HDFS
Running on Docker? (yes/no) : no

Issue Analytics

State:
Created 3 years ago
Comments:15 (3 by maintainers)

Top GitHub Comments

2reactions

spyzzzcommented, Mar 28, 2022

sorry for delay;

Here some tips about what i did :

    val df = SparkUtils.getStream(sparkSession,config).option("subscribe", table.name).load()

    val df2 = df
      .withColumn("deser_value",Deserializer.deser(table.name,config.getRegistryProps)(col("value")))
      .withColumn("parsed_value",from_json(col("deser_value"),sru.getLastestSchema(table.name).dataType))


    val df_upsert = df2.select("parsed_value")
      .select(
        col("parsed_value.after.*"),
        col("parsed_value.ts_ms"),
        col("parsed_value.op"),
        col("parsed_value.before."+table.pk).as("id_before")
      )
      .withColumn(table.pk,when(col("op") === 'd',col("id_before")).otherwise(col(table.pk))).drop("id_before")
      .filter(col(table.pk).isNotNull)
      .withColumn("_hoodie_is_deleted",when(col("op") === 'd' , true).otherwise(false))


      df_upsert.writeStream
        .trigger(Trigger.ProcessingTime("120 seconds"))
        .foreachBatch((df,id) => {
            df.write.format("org.apache.hudi")
              .options(HudiUtils.getHudiOptions(table))
              .options(HudiUtils.getHiveSyncOptions(table.name))
              .options(HudiUtils.getCompactionOptions)
              .mode(SaveMode.Append)
              .save(config.pathConf.outputPath + "/out/" + table.name )
        })
        .option("checkpointLocation",config.pathConf.outputPath + "/checkpoint/" + table.name)
        .start()

object HudiUtils {

  def  getHudiOptions(table:Table) : Map[String,String] ={
    Map(
      TABLE_TYPE_OPT_KEY -> "MERGE_ON_READ",
      PRECOMBINE_FIELD_OPT_KEY -> "ts_ms",
      RECORDKEY_FIELD_OPT_KEY -> table.pk,
      OPERATION_OPT_KEY -> "upsert",
      KEYGENERATOR_CLASS_OPT_KEY-> "org.apache.hudi.keygen.NonpartitionedKeyGenerator",
      TABLE_NAME_OPT_KEY -> ("hudi_" + table.name),
      "hoodie.table.name" -> ("hudi_" + table.name),
      "hoodie.upsert.shuffle.parallelism"->  "6",
      "hoodie.insert.shuffle.parallelism"-> "6",
      "hoodie.bulkinsert.shuffle.parallelism"-> "6",
      "hoodie.parquet.small.file.limit" -> "4194304"
    )
  }

  def getCompactionOptions : Map[String,String] = {

    Map(
      "hoodie.compact.inline" -> "true",
      "hoodie.compact.inline.max.delta.commits" -> "10",
      "hoodie.cleaner.commits.retained" -> "10",
      "hoodie.cleaner.fileversions.retained" -> "10",
      "hoodie.keep.min.commits" -> "12",
      "hoodie.keep.max.commits" -> "13"
      //"hoodie.clean.async" -> "false",
      //"hoodie.clean.automatic" ->"true",
      //"hoodie.parquet.compression.codec" -> "snappy"
    )
  }

  def getHiveSyncOptions(tableName:String) : Map[String,String] = {
    Map(
      HIVE_SYNC_ENABLED_OPT_KEY -> "true",
      HIVE_USE_JDBC_OPT_KEY -> "false",
      HIVE_DATABASE_OPT_KEY -> "raw_eu_hudi",
      HIVE_URL_OPT_KEY -> "thrift://x",
      HIVE_PARTITION_EXTRACTOR_CLASS_OPT_KEY-> "org.apache.hudi.hive.NonPartitionedExtractor",
      HIVE_TABLE_OPT_KEY -> tableName.split("\\.").drop(1).mkString("_")
    )
  }
}

object Deserializer extends Serializable {

  def deser(topic:String,props:Map[String,String]) = udf((input: Array[Byte]) => deserializeMessage(props,topic,input))

  val valueDeserializer = new KafkaAvroDeserializer()

  private def deserializeMessage(props:Map[String,String],topic:String,input: Array[Byte]): String = {
    try {
      valueDeserializer.configure(props.asJava,false)
      valueDeserializer.deserialize(topic, input).asInstanceOf[GenericRecord].toString
    } catch {
      case e: Exception => {
        e.printStackTrace()
        null
      }
    }
  }
}

1reaction

spyzzzcommented, Oct 15, 2020

Actually Deltasteamer can’t handle key in avro desirialiser that’s why i wasnt able to test it. Its hardcoded (String Deserializer) for KEY and Avro for value.

In my case both are serialized in avro

Top Results From Across the Web

[GitHub] [hudi] spyzzz commented on issue #2175: [SUPPORT ...

[GitHub] [hudi] spyzzz commented on issue #2175: [SUPPORT] HUDI MOR/COW tuning with spark structured streaming · GitBox Wed, 22 Sep 2021 09:14:50 -0700....

Spark Guide - Apache Hudi

This guide provides a quick peek at Hudi's capabilities using spark-shell. ... Hudi supports Spark Structured Streaming reads and writes.

Apache Hudi - Design/Code Walkthrough Session ... - YouTube

Apache Hudi (https:// hudi.apache.org/) is an open source spark library that ingests & manages storage of large analytical datasets over DFS ...

Using Hudi DeltaStreamer - The blaqfire Round up

I've been implementing a Data Lake ecosystem using S3 and Spark ... whereas Structured Streaming works atop the Spark SQL engine and allows ......

Hudi data overrides every time on new batch of spark structure ...

I am working on spark structure streaming where job consuming Kafka message, do aggregation and save data in apache hudi table every 10 ......