Processing time very Slow Updating records into Hudi Dataset(MOR) using AWS Glue
See original GitHub issueDescribe the problem you faced
Long time time executing Upserts in HUDI. it takes 4 or 5 times longer doing Updates than Inserts. 90% data needs to be updated
Code below takes around 45 minutes to write new data (300 million records) in AWS S3 Bucket HUDI format with 21 GPU using AWS Glue, but it takes more than 3 hours ingesting the same data set previously inserted to update and remove duplicates as previously data could be resent multiple times to correct the quality of the data and consumers only need the latest version of the record Key
Additional context
in Apache spark UI, Stage Building workload profilemapToPair at SparkHoodieBloomIndex.java:266_ which takes the longest in the execution plan shuffles the following
Shuffle Read Size / Records: 42.6 GiB / 540 000 000 Shuffle Write Size / Records: 1237.8 GiB / 23 759 659 000 Spill (Memory): 7.7 TiB Spill (Disk): 1241.6 GiB
Expected behavior
We have a window of 1 hour to execute the ETL process which include both inserts and updates. It means if only inserting takes 45 minutes, the updates should not take longer than 1 hour processing 300 million records
To Reproduce
Steps to reproduce the behavior:
Environment Description
*AWS Glue basic properties glue version : Glue 2.0 - Supports spark 2.4, Scala 2, Python Worker Type G.1x Number of workers : 31 (1 driver and 30 executors) ==> 160 cores
- Hudi version connector:ver_2.0_hudi_0.9.0_glue_1.0_or_2.0
- Storage (S3)
- Running on Docker? (yes/no) : no . running in AWS Glue
Code
import sys from awsglue.transforms import * from awsglue.utils import getResolvedOptions from pyspark.context import SparkContext from awsglue.context import GlueContext from awsglue.job import Job from pyspark.sql.functions import year, month, date_format, to_date, col from awsglue.dynamicframe import DynamicFrame from pyspark.sql.session import SparkSession
args = getResolvedOptions(sys.argv, [‘JOB_NAME’])
spark = SparkSession.builder.config(‘spark.serializer’,‘org.apache.spark.serializer.KryoSerializer’)
.config(‘spark.sql.hive.convertMetastoreParquet’,‘false’).getOrCreate()
sc = spark.sparkContext
glueContext = GlueContext(sc)
job = Job(glueContext)
job.init(args[‘JOB_NAME’], args)
targetPath = ‘s3://poc-lake-silver/interval_mor/data/’
commonConfig = { ‘className’ : ‘org.apache.hudi’, “path”: “s3://poc-lake-silver/interval_mor/data/”, ‘hoodie.bulkinsert.shuffle.parallelism’: 320, ‘hoodie.upsert.shuffle.parallelism’: 320, ‘hoodie.datasource.write.operation’: ‘upsert’ }
partitionDataConfig = { ‘hoodie.datasource.hive_sync.partition_extractor_class’: ‘org.apache.hudi.hive.MultiPartKeysValueExtractor’}
dataSourceWriteConfig = {
‘hoodie.datasource.write.table.type’: ‘MERGE_ON_READ’,
‘hoodie.datasource.write.keygenerator.class’: ‘org.apache.hudi.keygen.ComplexKeyGenerator’,
‘hoodie.datasource.write.precombine.field’: ‘ingestionutc’,
‘hoodie.datasource.write.partitionpath.field’: ‘intervaldate,plantuid’,
‘hoodie.datasource.write.recordkey.field’: ‘intervalutc,asset,attribute’
}
dataSourceHiveConfig = {
‘hoodie.datasource.hive_sync.use_jdbc’:‘false’,
‘hoodie.datasource.hive_sync.enable’: ‘true’,
‘hoodie.datasource.hive_sync.database’: ‘db_swat_lake_silver’,
‘hoodie.datasource.hive_sync.table’: ‘interval_mor’,
‘hoodie.datasource.hive_sync.partition_fields’: ‘intervaldate,plantuid’
}
dataTableConfig = { ‘hoodie.table.type’: ‘MERGE_ON_READ’, ‘hoodie.index.type’: ‘BLOOM’, ‘hoodie.table.keygenerator.class’: ‘org.apache.hudi.keygen.ComplexKeyGenerator’, ‘hoodie.database.name’: ‘db_swat_lake_silver’, ‘hoodie.table.name’: ‘interval_mor’, ‘hoodie.table.precombine.field’: ‘ingestionutc’, ‘hoodie.table.partition.fields’: ‘intervaldate,plantuid’, ‘hoodie.table.recordkey.fields’: ‘intervalutc,asset,attribute’ }
finalConf = {**commonConfig, **partitionDataConfig, **dataSourceWriteConfig, **dataSourceHiveConfig, **dataTableConfig} S3bucket_node1 = glueContext.create_dynamic_frame.from_options( format_options={}, connection_type=“s3”, format=“parquet”, connection_options={ “paths”: [ ‘s3://poc-lake-bronze/dc/interval/data/ingest_date=20220221/ingest_hour=04/’, ], “recurse”:True }, transformation_ctx=“S3bucket_node1”, )
ApplyMapping_node2 = ApplyMapping.apply(
frame=S3bucket_node1,
mappings=[
(“plant”, “int”, “plant”, “int”),
(“plantuid”, “string”, “plantuid”, “string”),
(“asset”, “int”, “asset”, “int”),
(“attribute”, “string”, “attribute”, “string”),
(“value”, “double”, “value”, “double”),
(“intervalutc”, “timestamp”, “intervalutc”, “timestamp”),
(“intervaldate”, “string”, “intervaldate”, “string”),
(“ingestiontsutc”, “timestamp”, “ingestionutc”, “timestamp”),
],
transformation_ctx=“ApplyMapping_node2”,
)
S3bucket_node3 = ApplyMapping_node2.toDF() S3bucket_node3.write.format(‘org.apache.hudi’).options(**finalConf).mode(‘Append’).save(targetPath)
job.commit()
Stacktrace
Spark Environment
Executors Summary
Stages
countByKey at SparkHoodieBloomIndex.java:114
Building workload profilemapToPair at SparkHoodieBloomIndex.java:266
Issue Analytics
- State:
- Created 2 years ago
- Comments:27 (7 by maintainers)
Top GitHub Comments
yes, do set
hoodie.compact.inline
to true and get the compaction moving. and later you can think about moving to async flow.btw, wrt partitioning strategy, it depends on your query patterns. If your queries mostly have predicates around dates, its wise choice to partition with dates.
Even your ingestion might speed up depending on your partitioning strategy.
For eg, if your dataset is date partitioned, and if your incoming records have data only for 5 partitions, index lookup happens only among the 5 partitions. but if your partitioning is complex and is based of of both date and some other field X, num of partitions touched might be more.
Also, do check out the cardinality of your partitioning scheme. Ensure you do not have very high no of partitions (small sized), nor too less num of partitions (large sized). try to find some middle ground.
then you also asked about timestamp characteristics of record keys right. Let me try to illustrate w/ an example. lets say, your record keys are literally timestamp field.
During commit1, you ingest records with min and max value as t1 to t100. and this goes into data file1 with commit2, you ingest records with min and max as t101 to t201. and this goes into data file2. and … commit10, say you ingest with min and max as t900 to t1000… data file 10.
now, lets say you have some updates. for record with key t55 to t65 and t305 to t310. since each file is nicely laid out, using just the min max values, all files except file1 and file3 will be filtered out in first step in index. And then bloom filter look up happens and then actual files will be looked up to find the right location for the keys.
Alternatively, lets say your record keys are random. commit1 -> data file1: min and max values are t5 and t500 commit2 -> data file2: min and max values are t50 and t3000 . . commit10 -> data file10: min and max values are t70 and t2500.
for when we get some updates, pretty much all files will be considered for 2nd step in index. in other words, min max based pruning will not be effective and it just adds to your latency.
Hope this clarifies what I mean by timestamp characteristics in your record keys.
@pmgod8922
Hi, my config is pretty much like this with some change to use MoR https://github.com/apache/hudi/issues/4896