[SUPPORT] High performance costs of AvroSerializer in Datasource writing
See original GitHub issueTips before filing an issue
-
Have you gone through our FAQs?
-
Join the mailing list to engage in conversations and get faster support at dev-subscribe@hudi.apache.org.
-
If you have triaged this as a bug, then file an issue directly.
Describe the problem you faced
When doing hudi datasource writing benchmark, we observed a large amount CPU time is spent in converting dataframe into RDD (HoodieSparkUtils::createRdd
). By looking at the profiling flame graph, we found around 80% of the reading time (source -> dataframe -> RDD) is spent in constructing internal variables of AvroSerializer
.
df.mapPartitions ( rows => {
val convert = new AvroSerializer()
rows.map( r => convert(r))
})
The above code is the pseudo code version of the current createRdd
implementation. At first glance, we thought the variable convert
is initialized once for each data partition, which should not cost too much. However, looking at its source code, it actually maintains a lambda function with some variables initialized inside. So for each input row, we have to do an almost full initialization of AvroSerializer
.
Because AvroSerializer
resides in spark-avro lib, it is not easy to directly optimize it in Hudi codebase. I am wondering if there is any workarounds to this, e.g., other way to convert df -> RDD, or re-implement a better version of AvroSerializer
in Hudi codebase.
To Reproduce
NA
Expected behavior
AvroSerializer
is initialized once for each data partition, or even once in driver to serialize to executors.
Environment Description
-
Hudi version : master
-
Spark version : spark3.2.0
-
Hive version :
-
Hadoop version :
-
Storage (HDFS/S3/GCS…) : Aliyun OSS
-
Running on Docker? (yes/no) : no
Additional context
Issue Analytics
- State:
- Created a year ago
- Reactions:1
- Comments:12 (12 by maintainers)
Top GitHub Comments
the reason why
converter
is re-initialized for each row is because anew AvroSerializer()
(explicit object creation) is happening for each row.I did some tests, exact same dataset.
git master
with the patch from earlier comment
Seems like it’s a regression introduced in #4789
new AvroSerializer()
for each partitionnew AvroSerializer()
for each rowThanks for flagging this @YuweiXiao, great catch!
To summarize what the issue is here: it is unfortunately a very sneaky one and it occurred accidentally during the refactoring of AvroSerializer/Deserializer hierarchy in Hudi.
Crux of the issue is that converter initializes AvroSerializer/Deserializer upon every invocation of it, b/c it’s done w/in the returned lambda itself (it also has a side-effect of pulling whole
SparkAdapter
into the closure):Instead it should have been