question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

[SUPPORT] HUDI MOR/COW tuning with spark structured streaming

See original GitHub issue

Hello,

Quick explanation of the situation, i’ve multiples kafka topics (one for each table) containing CDC events sent by debezium. I need to read in streaming thoses changes, and update corresponding table in HIVE (1.2).

Tables could be huge (200M+ events) but CDC are not very huge, lets says maximum few thousands per day per table. so the first sync could be painful, but once its done, CDC could be pretty ‘‘light’’.

I first tried DeltaStream but i need to do specific operation, such as filtering data, converting date so i’d rather do it in custom spark code to get more flexibility.

I decided to use structured streaming to connect to all my topics (two choice here: one stream connected to severeals topics, or on stream per topic)

1 -> this solution need a groupby topic to be able to save data in corresponding table (not simple)

spark.readStream.format("kafka").options(xxxxx).option("subscribe","all-topics")

2 -> This solution is easier to manage but its create lots of stream (more vcpu)

for (table <- tables ) {
spark.readStream.format("kafka").options(xxxxx).option("subscribe",table)
}

After this, i use writeStream in hudi format every 2min to write received data to corresponding table :

writeStream
        .trigger(Trigger.ProcessingTime("120 seconds"))
        .foreachBatch((df,id) => {
                   df.write.format("org.apache.hudi")
                    .options(HudiUtils.getHudiOptions(table))
                    .options(HudiUtils.getHiveSyncOptions(table.name))
                    .options(HudiUtils.getCompactionOptions)
                    .mode(SaveMode.Append)
                    .save(config.pathConf.outputPath + "/out/" + table.name )
        })
        .option("checkpointLocation",config.pathConf.outputPath + "/checkpoint/" + table.name)
        .start()

Here are my configuration :

For HUDI :

Map(
      TABLE_TYPE_OPT_KEY -> "COPY_ON_WRITE",
      PRECOMBINE_FIELD_OPT_KEY -> "ts_ms",
      RECORDKEY_FIELD_OPT_KEY -> table.pk,
      OPERATION_OPT_KEY -> "upsert",
      KEYGENERATOR_CLASS_OPT_KEY-> "org.apache.hudi.keygen.NonpartitionedKeyGenerator",
      TABLE_NAME_OPT_KEY -> ("hudi_" + table.name),
      "hoodie.table.name" -> ("hudi_" + table.name),
      "hoodie.upsert.shuffle.parallelism" -> "2",

For Compaction :

    Map(
      "hoodie.compact.inline" -> "true",
      "hoodie.compact.inline.max.delta.commits" -> "1",
      "hoodie.cleaner.commits.retained" -> "1",
      "hoodie.cleaner.fileversions.retained" -> "1",
      "hoodie.clean.async" -> "false",
      "hoodie.clean.automatic" ->"true",
      "hoodie.parquet.compression.codec" -> "snappy"
    )

For Spark :

    .config("spark.executor.cores", "3")
      .config("spark.executor.instances","5")
      .config("spark.executor.memory", "2g")
      .config("spark.rdd.compress","true")
      .config("spark.shuffle.service.enabled","true")
      .config("spark.sql.hive.convertMetastoreParquet","false")
      .config("spark.kryoserializer.buffer.max","512m")
      .config("spark.driver.memoryOverhead","1024")
      .config("spark.executor.memoryOverhead","3072")
      .config("spark.max.executor.failures","100")

Expected behavior I tried this code with a unique topic with 24K records, and its takes more than 5min to write to HDFS. with multiple topics its hangs and can be pretty long…

Capture d’écran 2020-10-14 à 12 04 15 Capture d’écran 2020-10-14 à 12 04 35

Environment Description

  • Hudi version : 0.6.0

  • Spark version : 2.4.6

  • Hive version : 1.2

  • Hadoop version : 2.7

  • Storage (HDFS/S3/GCS…) : HDFS

  • Running on Docker? (yes/no) : no

Issue Analytics

  • State:closed
  • Created 3 years ago
  • Comments:15 (3 by maintainers)

github_iconTop GitHub Comments

2reactions
spyzzzcommented, Mar 28, 2022

sorry for delay;

Here some tips about what i did :

    val df = SparkUtils.getStream(sparkSession,config).option("subscribe", table.name).load()

    val df2 = df
      .withColumn("deser_value",Deserializer.deser(table.name,config.getRegistryProps)(col("value")))
      .withColumn("parsed_value",from_json(col("deser_value"),sru.getLastestSchema(table.name).dataType))


    val df_upsert = df2.select("parsed_value")
      .select(
        col("parsed_value.after.*"),
        col("parsed_value.ts_ms"),
        col("parsed_value.op"),
        col("parsed_value.before."+table.pk).as("id_before")
      )
      .withColumn(table.pk,when(col("op") === 'd',col("id_before")).otherwise(col(table.pk))).drop("id_before")
      .filter(col(table.pk).isNotNull)
      .withColumn("_hoodie_is_deleted",when(col("op") === 'd' , true).otherwise(false))


      df_upsert.writeStream
        .trigger(Trigger.ProcessingTime("120 seconds"))
        .foreachBatch((df,id) => {
            df.write.format("org.apache.hudi")
              .options(HudiUtils.getHudiOptions(table))
              .options(HudiUtils.getHiveSyncOptions(table.name))
              .options(HudiUtils.getCompactionOptions)
              .mode(SaveMode.Append)
              .save(config.pathConf.outputPath + "/out/" + table.name )
        })
        .option("checkpointLocation",config.pathConf.outputPath + "/checkpoint/" + table.name)
        .start()
object HudiUtils {

  def  getHudiOptions(table:Table) : Map[String,String] ={
    Map(
      TABLE_TYPE_OPT_KEY -> "MERGE_ON_READ",
      PRECOMBINE_FIELD_OPT_KEY -> "ts_ms",
      RECORDKEY_FIELD_OPT_KEY -> table.pk,
      OPERATION_OPT_KEY -> "upsert",
      KEYGENERATOR_CLASS_OPT_KEY-> "org.apache.hudi.keygen.NonpartitionedKeyGenerator",
      TABLE_NAME_OPT_KEY -> ("hudi_" + table.name),
      "hoodie.table.name" -> ("hudi_" + table.name),
      "hoodie.upsert.shuffle.parallelism"->  "6",
      "hoodie.insert.shuffle.parallelism"-> "6",
      "hoodie.bulkinsert.shuffle.parallelism"-> "6",
      "hoodie.parquet.small.file.limit" -> "4194304"
    )
  }

  def getCompactionOptions : Map[String,String] = {

    Map(
      "hoodie.compact.inline" -> "true",
      "hoodie.compact.inline.max.delta.commits" -> "10",
      "hoodie.cleaner.commits.retained" -> "10",
      "hoodie.cleaner.fileversions.retained" -> "10",
      "hoodie.keep.min.commits" -> "12",
      "hoodie.keep.max.commits" -> "13"
      //"hoodie.clean.async" -> "false",
      //"hoodie.clean.automatic" ->"true",
      //"hoodie.parquet.compression.codec" -> "snappy"
    )
  }

  def getHiveSyncOptions(tableName:String) : Map[String,String] = {
    Map(
      HIVE_SYNC_ENABLED_OPT_KEY -> "true",
      HIVE_USE_JDBC_OPT_KEY -> "false",
      HIVE_DATABASE_OPT_KEY -> "raw_eu_hudi",
      HIVE_URL_OPT_KEY -> "thrift://x",
      HIVE_PARTITION_EXTRACTOR_CLASS_OPT_KEY-> "org.apache.hudi.hive.NonPartitionedExtractor",
      HIVE_TABLE_OPT_KEY -> tableName.split("\\.").drop(1).mkString("_")
    )
  }
}
object Deserializer extends Serializable {

  def deser(topic:String,props:Map[String,String]) = udf((input: Array[Byte]) => deserializeMessage(props,topic,input))

  val valueDeserializer = new KafkaAvroDeserializer()

  private def deserializeMessage(props:Map[String,String],topic:String,input: Array[Byte]): String = {
    try {
      valueDeserializer.configure(props.asJava,false)
      valueDeserializer.deserialize(topic, input).asInstanceOf[GenericRecord].toString
    } catch {
      case e: Exception => {
        e.printStackTrace()
        null
      }
    }
  }
}
1reaction
spyzzzcommented, Oct 15, 2020

Actually Deltasteamer can’t handle key in avro desirialiser that’s why i wasnt able to test it. Its hardcoded (String Deserializer) for KEY and Avro for value.

In my case both are serialized in avro

Read more comments on GitHub >

github_iconTop Results From Across the Web

[GitHub] [hudi] spyzzz commented on issue #2175: [SUPPORT ...
[GitHub] [hudi] spyzzz commented on issue #2175: [SUPPORT] HUDI MOR/COW tuning with spark structured streaming · GitBox Wed, 22 Sep 2021 09:14:50 -0700....
Read more >
Spark Guide - Apache Hudi
This guide provides a quick peek at Hudi's capabilities using spark-shell. ... Hudi supports Spark Structured Streaming reads and writes.
Read more >
Apache Hudi - Design/Code Walkthrough Session ... - YouTube
Apache Hudi (https:// hudi.apache.org/) is an open source spark library that ingests & manages storage of large analytical datasets over DFS ...
Read more >
Using Hudi DeltaStreamer - The blaqfire Round up
I've been implementing a Data Lake ecosystem using S3 and Spark ... whereas Structured Streaming works atop the Spark SQL engine and allows ......
Read more >
Hudi data overrides every time on new batch of spark structure ...
I am working on spark structure streaming where job consuming Kafka message, do aggregation and save data in apache hudi table every 10 ......
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found