question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

No hudi dataset was saved to s3

See original GitHub issue

I am using deltastreamer to load files uploaded to s3 bucket. Everything works fine with --class org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer, but failed with --class org.apache.hudi.utilities.deltastreamer.HoodieMultiTableDeltaStreamer when trying to load multiple tables.

Here is part of the log. Seems it processed the data, but I don’t see any dataset created.

22/06/01 22:57:29 INFO TaskSetManager: Starting task 197.0 in stage 23.0 (TID 1802) (ip-172-31-20-205.ec2.internal, executor 2, partition 197, PROCESS_LOCAL, 4282 bytes) taskResourceAssignments Map()
22/06/01 22:57:29 INFO TaskSetManager: Finished task 189.0 in stage 23.0 (TID 1794) in 7 ms on ip-172-31-20-205.ec2.internal (executor 2) (190/200)
22/06/01 22:57:29 INFO TaskSetManager: Finished task 190.0 in stage 23.0 (TID 1795) in 7 ms on ip-172-31-20-205.ec2.internal (executor 2) (191/200)
22/06/01 22:57:29 INFO TaskSetManager: Starting task 198.0 in stage 23.0 (TID 1803) (ip-172-31-20-205.ec2.internal, executor 2, partition 198, PROCESS_LOCAL, 4282 bytes) taskResourceAssignments Map()
22/06/01 22:57:29 INFO TaskSetManager: Starting task 199.0 in stage 23.0 (TID 1804) (ip-172-31-30-20.ec2.internal, executor 1, partition 199, PROCESS_LOCAL, 4282 bytes) taskResourceAssignments Map()
22/06/01 22:57:29 INFO TaskSetManager: Finished task 192.0 in stage 23.0 (TID 1797) in 7 ms on ip-172-31-30-20.ec2.internal (executor 1) (192/200)
22/06/01 22:57:29 INFO TaskSetManager: Finished task 191.0 in stage 23.0 (TID 1796) in 7 ms on ip-172-31-30-20.ec2.internal (executor 1) (193/200)
22/06/01 22:57:29 INFO TaskSetManager: Finished task 196.0 in stage 23.0 (TID 1801) in 4 ms on ip-172-31-30-20.ec2.internal (executor 1) (194/200)
22/06/01 22:57:29 INFO TaskSetManager: Finished task 195.0 in stage 23.0 (TID 1800) in 5 ms on ip-172-31-30-20.ec2.internal (executor 1) (195/200)
22/06/01 22:57:29 INFO TaskSetManager: Finished task 194.0 in stage 23.0 (TID 1799) in 7 ms on ip-172-31-20-205.ec2.internal (executor 2) (196/200)
22/06/01 22:57:29 INFO TaskSetManager: Finished task 199.0 in stage 23.0 (TID 1804) in 3 ms on ip-172-31-30-20.ec2.internal (executor 1) (197/200)
22/06/01 22:57:29 INFO TaskSetManager: Finished task 197.0 in stage 23.0 (TID 1802) in 7 ms on ip-172-31-20-205.ec2.internal (executor 2) (198/200)
22/06/01 22:57:29 INFO TaskSetManager: Finished task 198.0 in stage 23.0 (TID 1803) in 7 ms on ip-172-31-20-205.ec2.internal (executor 2) (199/200)
22/06/01 22:57:29 INFO TaskSetManager: Finished task 193.0 in stage 23.0 (TID 1798) in 11 ms on ip-172-31-20-205.ec2.internal (executor 2) (200/200)
22/06/01 22:57:29 INFO YarnScheduler: Removed TaskSet 23.0, whose tasks have all completed, from pool
22/06/01 22:57:29 INFO DAGScheduler: ResultStage 23 (collect at SparkRejectUpdateStrategy.java:52) finished in 0.221 s
22/06/01 22:57:29 INFO DAGScheduler: Job 8 is finished. Cancelling potential speculative or zombie tasks for this job
22/06/01 22:57:29 INFO YarnScheduler: Killing all running tasks in stage 23: Stage finished
22/06/01 22:57:29 INFO DAGScheduler: Job 8 finished: collect at SparkRejectUpdateStrategy.java:52, took 0.468092 s
22/06/01 22:57:29 INFO SparkContext: Starting job: sum at DeltaSync.java:557
22/06/01 22:57:29 INFO DAGScheduler: Job 9 finished: sum at DeltaSync.java:557, took 0.000174 s
22/06/01 22:57:29 INFO SparkContext: Starting job: sum at DeltaSync.java:558
22/06/01 22:57:29 INFO DAGScheduler: Job 10 finished: sum at DeltaSync.java:558, took 0.000157 s
22/06/01 22:57:29 INFO SparkContext: Starting job: collect at SparkRDDWriteClient.java:124
22/06/01 22:57:29 INFO DAGScheduler: Job 11 finished: collect at SparkRDDWriteClient.java:124, took 0.000179 s
22/06/01 22:57:29 INFO S3NativeFileSystem: Opening 's3://fei-hudi-demo/s3_hudi_table/.hoodie/hoodie.properties' for reading
22/06/01 22:57:29 INFO S3NativeFileSystem: Opening 's3://fei-hudi-demo/s3_hudi_table/.hoodie/hoodie.properties' for reading
22/06/01 22:57:29 INFO MultipartUploadOutputStream: close closed:false s3://fei-hudi-demo/s3_hudi_table/.hoodie/20220601225723633.commit
22/06/01 22:57:29 INFO S3NativeFileSystem: Opening 's3://fei-hudi-demo/s3_hudi_table/.hoodie/hoodie.properties' for reading
22/06/01 22:57:29 INFO S3NativeFileSystem: Opening 's3://fei-hudi-demo/s3_hudi_table/.hoodie/hoodie.properties' for reading
22/06/01 22:57:30 INFO S3NativeFileSystem: Opening 's3://fei-hudi-demo/s3_hudi_table/.hoodie/hoodie.properties' for reading
22/06/01 22:57:30 INFO MapPartitionsRDD: Removing RDD 73 from persistence list
22/06/01 22:57:30 INFO BlockManager: Removing RDD 73
22/06/01 22:57:30 INFO MapPartitionsRDD: Removing RDD 60 from persistence list
22/06/01 22:57:30 INFO BlockManager: Removing RDD 60
22/06/01 22:57:30 INFO S3NativeFileSystem: Opening 's3://fei-hudi-demo/s3_hudi_table/.hoodie/hoodie.properties' for reading
22/06/01 22:57:30 INFO S3NativeFileSystem: Opening 's3://fei-hudi-demo/s3_hudi_table/.hoodie/20220601225723633.commit' for reading
22/06/01 22:57:30 INFO S3NativeFileSystem: Opening 's3://fei-hudi-demo/s3_meta_table/.hoodie/hoodie.properties' for reading
22/06/01 22:57:30 WARN S3EventsHoodieIncrSource: Already caught up. Begin Checkpoint was :20220601225531374

Here is the command I used

spark-submit \
--jars "/usr/lib/spark/external/lib/spark-avro.jar,s3://fei-hudi-demo/aws-java-sdk-sqs-1.12.220.jar" \
--master yarn --deploy-mode client \
--class org.apache.hudi.utilities.deltastreamer.HoodieMultiTableDeltaStreamer /usr/lib/hudi/hudi-utilities-bundle.jar \
--table-type COPY_ON_WRITE \
--enable-hive-sync \
--source-ordering-field metricvalue --base-path-prefix s3://fei-hudi-demo/s3_hudi_table \
--target-table dummy_table --continuous --min-sync-interval-seconds 10 \
--props s3://fei-hudi-demo/hudi-ingestion-config/source.properties \
--config-folder s3://fei-hudi-demo/hudi-ingestion-config \
--source-class org.apache.hudi.utilities.sources.S3EventsHoodieIncrSource 

source.properties

hoodie.deltastreamer.ingestion.tablesToBeIngested=fei_hudi_test.table1
hoodie.deltastreamer.ingestion.fei_hudi_test.table1.configFile=s3://fei-hudi-demo/hudi-ingestion-config/t1.properties

hoodie.datasource.write.keygenerator.class=org.apache.hudi.keygen.CustomKeyGenerator
hoodie.datasource.hive_sync.partition_extractor_class=org.apache.hudi.hive.MultiPartKeysValueExtractor
hoodie.datasource.write.hive_style_partitioning=true 
hoodie.datasource.hive_sync.database=fei_hudi_test 
hoodie.deltastreamer.source.hoodieincr.read_latest_on_missing_ckpt=true 
hoodie.deltastreamer.source.s3incr.key.prefix=""

t1.properties

hoodie.datasource.write.recordkey.field=launchId
hoodie.datasource.write.partitionpath.field=appPlatform:simple,appVersion:simple
hoodie.datasource.hive_sync.partition_fields=appPlatform,appVersion
hoodie.deltastreamer.source.hoodieincr.path=s3://fei-hudi-demo/s3_meta_table

Thanks.

Issue Analytics

  • State:closed
  • Created a year ago
  • Comments:7 (6 by maintainers)

github_iconTop GitHub Comments

1reaction
nsivabalancommented, Aug 16, 2022

@pratyakshsharma : can you follow up here please.

0reactions
xushiyancommented, Oct 30, 2022

close due to inactivity

Read more comments on GitHub >

github_iconTop Results From Across the Web

Query an Apache Hudi dataset in an Amazon S3 data lake ...
An Apache Hudi dataset can be one of the following table types: Copy on Write (CoW) – Data is stored in columnar format...
Read more >
AWS S3 - Apache Hudi
In this page, we explain how to get your Hudi spark job to store into AWS S3. ... such variables and then have...
Read more >
Query and Update Hudi Dynamic Dataset in AWS S3 Data ...
In this blog, I am going to test it and see if Athena can read Hudi format data set in S3.
Read more >
Using Athena to query Apache Hudi datasets - 亚马逊云科技
Data sets managed by Hudi are stored in S3 using open storage formats. Currently, Athena can read compacted Hudi datasets but not write...
Read more >
Data Lakehouse on S3 Data Lake (w/o Hudi or Delta Lake)
Apache Hudi datasets consist of Copy on Write (CoW) and Merge on Read (MoR) table types. Copy on Write (CoW) has data stored...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found