question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Processing time very Slow Updating records into Hudi Dataset(MOR) using AWS Glue

See original GitHub issue

Describe the problem you faced

Long time time executing Upserts in HUDI. it takes 4 or 5 times longer doing Updates than Inserts. 90% data needs to be updated

Code below takes around 45 minutes to write new data (300 million records) in AWS S3 Bucket HUDI format with 21 GPU using AWS Glue, but it takes more than 3 hours ingesting the same data set previously inserted to update and remove duplicates as previously data could be resent multiple times to correct the quality of the data and consumers only need the latest version of the record Key

Additional context

in Apache spark UI, Stage Building workload profilemapToPair at SparkHoodieBloomIndex.java:266_ which takes the longest in the execution plan shuffles the following

Shuffle Read Size / Records: 42.6 GiB / 540 000 000 Shuffle Write Size / Records: 1237.8 GiB / 23 759 659 000 Spill (Memory): 7.7 TiB Spill (Disk): 1241.6 GiB

Expected behavior

We have a window of 1 hour to execute the ETL process which include both inserts and updates. It means if only inserting takes 45 minutes, the updates should not take longer than 1 hour processing 300 million records

To Reproduce

Steps to reproduce the behavior:

Environment Description

*AWS Glue basic properties glue version : Glue 2.0 - Supports spark 2.4, Scala 2, Python Worker Type G.1x Number of workers : 31 (1 driver and 30 executors) ==> 160 cores

  • Hudi version connector:ver_2.0_hudi_0.9.0_glue_1.0_or_2.0
  • Storage (S3)
  • Running on Docker? (yes/no) : no . running in AWS Glue

Code

import sys from awsglue.transforms import * from awsglue.utils import getResolvedOptions from pyspark.context import SparkContext from awsglue.context import GlueContext from awsglue.job import Job from pyspark.sql.functions import year, month, date_format, to_date, col from awsglue.dynamicframe import DynamicFrame from pyspark.sql.session import SparkSession

args = getResolvedOptions(sys.argv, [‘JOB_NAME’]) spark = SparkSession.builder.config(‘spark.serializer’,‘org.apache.spark.serializer.KryoSerializer’)
.config(‘spark.sql.hive.convertMetastoreParquet’,‘false’).getOrCreate() sc = spark.sparkContext glueContext = GlueContext(sc) job = Job(glueContext) job.init(args[‘JOB_NAME’], args) targetPath = ‘s3://poc-lake-silver/interval_mor/data/’

commonConfig = { ‘className’ : ‘org.apache.hudi’, “path”: “s3://poc-lake-silver/interval_mor/data/”, ‘hoodie.bulkinsert.shuffle.parallelism’: 320, ‘hoodie.upsert.shuffle.parallelism’: 320, ‘hoodie.datasource.write.operation’: ‘upsert’ }

partitionDataConfig = { ‘hoodie.datasource.hive_sync.partition_extractor_class’: ‘org.apache.hudi.hive.MultiPartKeysValueExtractor’}

dataSourceWriteConfig = { ‘hoodie.datasource.write.table.type’: ‘MERGE_ON_READ’,
‘hoodie.datasource.write.keygenerator.class’: ‘org.apache.hudi.keygen.ComplexKeyGenerator’, ‘hoodie.datasource.write.precombine.field’: ‘ingestionutc’, ‘hoodie.datasource.write.partitionpath.field’: ‘intervaldate,plantuid’, ‘hoodie.datasource.write.recordkey.field’: ‘intervalutc,asset,attribute’ } dataSourceHiveConfig = { ‘hoodie.datasource.hive_sync.use_jdbc’:‘false’, ‘hoodie.datasource.hive_sync.enable’: ‘true’, ‘hoodie.datasource.hive_sync.database’: ‘db_swat_lake_silver’, ‘hoodie.datasource.hive_sync.table’: ‘interval_mor’, ‘hoodie.datasource.hive_sync.partition_fields’: ‘intervaldate,plantuid’ }

dataTableConfig = { ‘hoodie.table.type’: ‘MERGE_ON_READ’, ‘hoodie.index.type’: ‘BLOOM’, ‘hoodie.table.keygenerator.class’: ‘org.apache.hudi.keygen.ComplexKeyGenerator’, ‘hoodie.database.name’: ‘db_swat_lake_silver’, ‘hoodie.table.name’: ‘interval_mor’, ‘hoodie.table.precombine.field’: ‘ingestionutc’, ‘hoodie.table.partition.fields’: ‘intervaldate,plantuid’, ‘hoodie.table.recordkey.fields’: ‘intervalutc,asset,attribute’ }

finalConf = {**commonConfig, **partitionDataConfig, **dataSourceWriteConfig, **dataSourceHiveConfig, **dataTableConfig} S3bucket_node1 = glueContext.create_dynamic_frame.from_options( format_options={}, connection_type=“s3”, format=“parquet”, connection_options={ “paths”: [ ‘s3://poc-lake-bronze/dc/interval/data/ingest_date=20220221/ingest_hour=04/’, ], “recurse”:True }, transformation_ctx=“S3bucket_node1”, )

ApplyMapping_node2 = ApplyMapping.apply( frame=S3bucket_node1, mappings=[ (“plant”, “int”, “plant”, “int”), (“plantuid”, “string”, “plantuid”, “string”),
(“asset”, “int”, “asset”, “int”), (“attribute”, “string”, “attribute”, “string”), (“value”, “double”, “value”, “double”), (“intervalutc”, “timestamp”, “intervalutc”, “timestamp”), (“intervaldate”, “string”, “intervaldate”, “string”), (“ingestiontsutc”, “timestamp”, “ingestionutc”, “timestamp”), ], transformation_ctx=“ApplyMapping_node2”, )

S3bucket_node3 = ApplyMapping_node2.toDF() S3bucket_node3.write.format(‘org.apache.hudi’).options(**finalConf).mode(‘Append’).save(targetPath)

job.commit()

Stacktrace

Spark Environment image

Executors Summary image

Stages image

countByKey at SparkHoodieBloomIndex.java:114 image

Building workload profilemapToPair at SparkHoodieBloomIndex.java:266

image

Issue Analytics

  • State:open
  • Created 2 years ago
  • Comments:27 (7 by maintainers)

github_iconTop GitHub Comments

3reactions
nsivabalancommented, Feb 25, 2022

yes, do set hoodie.compact.inline to true and get the compaction moving. and later you can think about moving to async flow.

btw, wrt partitioning strategy, it depends on your query patterns. If your queries mostly have predicates around dates, its wise choice to partition with dates.

Even your ingestion might speed up depending on your partitioning strategy.

For eg, if your dataset is date partitioned, and if your incoming records have data only for 5 partitions, index lookup happens only among the 5 partitions. but if your partitioning is complex and is based of of both date and some other field X, num of partitions touched might be more.

Also, do check out the cardinality of your partitioning scheme. Ensure you do not have very high no of partitions (small sized), nor too less num of partitions (large sized). try to find some middle ground.

then you also asked about timestamp characteristics of record keys right. Let me try to illustrate w/ an example. lets say, your record keys are literally timestamp field.

During commit1, you ingest records with min and max value as t1 to t100. and this goes into data file1 with commit2, you ingest records with min and max as t101 to t201. and this goes into data file2. and … commit10, say you ingest with min and max as t900 to t1000… data file 10.

now, lets say you have some updates. for record with key t55 to t65 and t305 to t310. since each file is nicely laid out, using just the min max values, all files except file1 and file3 will be filtered out in first step in index. And then bloom filter look up happens and then actual files will be looked up to find the right location for the keys.

Alternatively, lets say your record keys are random. commit1 -> data file1: min and max values are t5 and t500 commit2 -> data file2: min and max values are t50 and t3000 . . commit10 -> data file10: min and max values are t70 and t2500.

for when we get some updates, pretty much all files will be considered for 2nd step in index. in other words, min max based pruning will not be effective and it just adds to your latency.

Hope this clarifies what I mean by timestamp characteristics in your record keys.

1reaction
Gatsby-Leecommented, Mar 9, 2022

@pmgod8922

Hi, my config is pretty much like this with some change to use MoR https://github.com/apache/hudi/issues/4896

	'hoodie.compact.inline': 'false'
	'hoodie.datasource.compaction.async.enable': 'true'
Read more comments on GitHub >

github_iconTop Results From Across the Web

[GitHub] [hudi] Gatsby-Lee commented on issue #4873: Processing ...
[GitHub] [hudi] Gatsby-Lee commented on issue #4873: Processing time very Slow Updating records into Hudi Dataset(MOR) using AWS Glue.
Read more >
How Hudl built a cost-optimized AWS Glue pipeline with ...
Conclusion. With Apache Hudi's open-source data management framework, we simplified incremental data processing in our AWS Glue data pipeline  ...
Read more >
Build your Apache Hudi data lake on AWS using Amazon EMR
Apache Hudi is an open-source transactional data lake framework that greatly simplifies incremental data processing and data pipeline ...
Read more >
Build Slowly Changing Dimensions Type 2 (SCD2) with ...
Apache Hudi is an open-source data management framework used to simplify incremental data processing and data pipeline development. Hudi enables ...
Read more >
Get started with Apache Hudi using AWS Glue by ...
When you use the CoW table type, committed data is implicitly compacted, meaning it's updated to columnar file format during write operation.
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found