question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

[SUPPORT] High performance costs of AvroSerializer in Datasource writing

See original GitHub issue

Tips before filing an issue

  • Have you gone through our FAQs?

  • Join the mailing list to engage in conversations and get faster support at dev-subscribe@hudi.apache.org.

  • If you have triaged this as a bug, then file an issue directly.

Describe the problem you faced

When doing hudi datasource writing benchmark, we observed a large amount CPU time is spent in converting dataframe into RDD (HoodieSparkUtils::createRdd). By looking at the profiling flame graph, we found around 80% of the reading time (source -> dataframe -> RDD) is spent in constructing internal variables of AvroSerializer.

df.mapPartitions ( rows => {
  val convert = new AvroSerializer()
  rows.map( r => convert(r))
})

The above code is the pseudo code version of the current createRdd implementation. At first glance, we thought the variable convert is initialized once for each data partition, which should not cost too much. However, looking at its source code, it actually maintains a lambda function with some variables initialized inside. So for each input row, we have to do an almost full initialization of AvroSerializer.

Because AvroSerializer resides in spark-avro lib, it is not easy to directly optimize it in Hudi codebase. I am wondering if there is any workarounds to this, e.g., other way to convert df -> RDD, or re-implement a better version of AvroSerializer in Hudi codebase.

To Reproduce

NA

Expected behavior

AvroSerializer is initialized once for each data partition, or even once in driver to serialize to executors.

Environment Description

  • Hudi version : master

  • Spark version : spark3.2.0

  • Hive version :

  • Hadoop version :

  • Storage (HDFS/S3/GCS…) : Aliyun OSS

  • Running on Docker? (yes/no) : no

Additional context

image

Issue Analytics

  • State:closed
  • Created a year ago
  • Reactions:1
  • Comments:12 (12 by maintainers)

github_iconTop GitHub Comments

1reaction
qjqqyycommented, Mar 24, 2022

@qjqqyy Yes, each row will initialize AvroSerializer (variables in the lambda named converter)

the reason why converter is re-initialized for each row is because a new AvroSerializer() (explicit object creation) is happening for each row.

I did some tests, exact same dataset.

git master

Screenshot 2022-03-24 at 7 02 57 PM

with the patch from earlier comment

Screenshot 2022-03-24 at 7 03 28 PM

Seems like it’s a regression introduced in #4789

  • before: 1 invocation of new AvroSerializer() for each partition
  • after: 1 invocation of new AvroSerializer() for each row
0reactions
alexeykudinkincommented, Mar 26, 2022

Thanks for flagging this @YuweiXiao, great catch!

To summarize what the issue is here: it is unfortunately a very sneaky one and it occurred accidentally during the refactoring of AvroSerializer/Deserializer hierarchy in Hudi.

Crux of the issue is that converter initializes AvroSerializer/Deserializer upon every invocation of it, b/c it’s done w/in the returned lambda itself (it also has a side-effect of pulling whole SparkAdapter into the closure):

def createAvroToInternalRowConverter(rootAvroType: Schema, rootCatalystType: StructType): GenericRecord => Option[InternalRow] =
    record => sparkAdapter.createAvroDeserializer(rootAvroType, rootCatalystType)
      .deserialize(record)
      .map(_.asInstanceOf[InternalRow])

Instead it should have been

def createAvroToInternalRowConverter(rootAvroType: Schema, rootCatalystType: StructType): GenericRecord => Option[InternalRow] = { 
  val deserilizer = sparkAdapter.createAvroDeserializer(rootAvroType, rootCatalystType) 
  record => 
    deserializer.deserialize(record).map(_.asInstanceOf[InternalRow]) }
Read more comments on GitHub >

github_iconTop Results From Across the Web

Apache Avro Data Source Guide - Spark 3.3.1 Documentation
Apache Avro Data Source Guide ... Since Spark 2.4 release, Spark SQL provides built-in support for reading and writing Apache Avro data. Deploying....
Read more >
On the crest of streams with Flink | by Vladimirs Kotovs - Medium
Flink provides Apache Kafka connector for reading data from and writing data ... can easily cost a lot of your job's performance if...
Read more >
Avro file - Azure Databricks - Microsoft Learn
Learn how to read and write data to Avro files using Azure Databricks. ... The Avro data source supports reading union types.
Read more >
Avro Schema Serializer and Deserializer
This document describes how to use Avro schemas with the Apache Kafka® Java client and console tools. Avro Serializer¶. You can plug KafkaAvroSerializer...
Read more >
Loading Avro data from Cloud Storage | BigQuery
Avro is an open source data format that bundles serialized data with the data's schema in the same file. When you load Avro...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found