Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

[SUPPORT] High performance costs of AvroSerializer in Datasource writing

See original GitHub issue

Tips before filing an issue

Have you gone through our FAQs?
Join the mailing list to engage in conversations and get faster support at dev-subscribe@hudi.apache.org.
If you have triaged this as a bug, then file an issue directly.

Describe the problem you faced

When doing hudi datasource writing benchmark, we observed a large amount CPU time is spent in converting dataframe into RDD (HoodieSparkUtils::createRdd). By looking at the profiling flame graph, we found around 80% of the reading time (source -> dataframe -> RDD) is spent in constructing internal variables of AvroSerializer.

df.mapPartitions ( rows => {
  val convert = new AvroSerializer()
  rows.map( r => convert(r))
})

The above code is the pseudo code version of the current createRdd implementation. At first glance, we thought the variable convert is initialized once for each data partition, which should not cost too much. However, looking at its source code, it actually maintains a lambda function with some variables initialized inside. So for each input row, we have to do an almost full initialization of AvroSerializer.

Because AvroSerializer resides in spark-avro lib, it is not easy to directly optimize it in Hudi codebase. I am wondering if there is any workarounds to this, e.g., other way to convert df -> RDD, or re-implement a better version of AvroSerializer in Hudi codebase.

To Reproduce

Expected behavior

AvroSerializer is initialized once for each data partition, or even once in driver to serialize to executors.

Environment Description

Hudi version : master
Spark version : spark3.2.0
Hive version :
Hadoop version :
Storage (HDFS/S3/GCS…) : Aliyun OSS
Running on Docker? (yes/no) : no

Additional context

Issue Analytics

State:
Created a year ago
Reactions:1
Comments:12 (12 by maintainers)

Top GitHub Comments

1reaction

qjqqyycommented, Mar 24, 2022

@qjqqyy Yes, each row will initialize AvroSerializer (variables in the lambda named converter)

the reason why converter is re-initialized for each row is because a new AvroSerializer() (explicit object creation) is happening for each row.

I did some tests, exact same dataset.

git master

with the patch from earlier comment

Seems like it’s a regression introduced in #4789

before: 1 invocation of new AvroSerializer() for each partition
after: 1 invocation of new AvroSerializer() for each row

0reactions

alexeykudinkincommented, Mar 26, 2022

Thanks for flagging this @YuweiXiao, great catch!

To summarize what the issue is here: it is unfortunately a very sneaky one and it occurred accidentally during the refactoring of AvroSerializer/Deserializer hierarchy in Hudi.

Crux of the issue is that converter initializes AvroSerializer/Deserializer upon every invocation of it, b/c it’s done w/in the returned lambda itself (it also has a side-effect of pulling whole SparkAdapter into the closure):

def createAvroToInternalRowConverter(rootAvroType: Schema, rootCatalystType: StructType): GenericRecord => Option[InternalRow] =
    record => sparkAdapter.createAvroDeserializer(rootAvroType, rootCatalystType)
      .deserialize(record)
      .map(_.asInstanceOf[InternalRow])

Instead it should have been

def createAvroToInternalRowConverter(rootAvroType: Schema, rootCatalystType: StructType): GenericRecord => Option[InternalRow] = { 
  val deserilizer = sparkAdapter.createAvroDeserializer(rootAvroType, rootCatalystType) 
  record => 
    deserializer.deserialize(record).map(_.asInstanceOf[InternalRow]) }

Top Results From Across the Web

Apache Avro Data Source Guide - Spark 3.3.1 Documentation

Apache Avro Data Source Guide ... Since Spark 2.4 release, Spark SQL provides built-in support for reading and writing Apache Avro data. Deploying....

On the crest of streams with Flink | by Vladimirs Kotovs - Medium

Flink provides Apache Kafka connector for reading data from and writing data ... can easily cost a lot of your job's performance if...

Avro file - Azure Databricks - Microsoft Learn

Learn how to read and write data to Avro files using Azure Databricks. ... The Avro data source supports reading union types.

Avro Schema Serializer and Deserializer

This document describes how to use Avro schemas with the Apache Kafka® Java client and console tools. Avro Serializer¶. You can plug KafkaAvroSerializer...

Loading Avro data from Cloud Storage | BigQuery

Avro is an open source data format that bundles serialized data with the data's schema in the same file. When you load Avro...