Some questions about using hudi
See original GitHub issueBackground
- I am using non-partitioned copy on write hoodie table.
- I am using Datasource API for bulk insert.
- I am using spark streaming for incremental upsert( insert, update, delete)
Questions
- Which parameter controls the parquet file size? I had set the following parameter but not working. There are lots of files with size 1.x MB after I bulk insert into hoodie table.
.option("hoodie.upsert.shuffle.parallelism","200") .option("hoodie.insert.shuffle.parallelism", "100") .option("hoodie.upsert.shuffle.parallelism", "100") .option(HoodieStorageConfig.PARQUET_FILE_MAX_BYTES, 256 * 1024 * 1024) .option(HoodieCompactionConfig.PARQUET_SMALL_FILE_LIMIT_BYTES, 128 * 1024 * 1024)
- For incremtal upserts, My workload is about 1000Record per batch. But It costs 2 mins , and most of the time is spent on
HoodieBloomIndex.loadInvolvedFiles
. The stage have so many parallel. Which parameter controls the parallel.
Issue Analytics
- State:
- Created 5 years ago
- Comments:5 (3 by maintainers)
Top Results From Across the Web
FAQs | Apache Hudi
FAQs. General. When is Hudi useful for me or my organization?. If you are looking to quickly ingest data onto HDFS or cloud...
Read more >Some questions about using hudi · Issue #552 - GitHub
I am using Datasource API for bulk insert. ... Which parameter controls the parquet file size? I had set the following parameter but...
Read more >Re: Questions about using Hudi
gmail.com>, wrote: > Hi, > > I have some questions when I try to use Hudi in my company's prod env: > >...
Read more >Newest 'apache-hudi' Questions - Stack Overflow
I am trying to view some data from Hudi using below code in spark. import org.apache.hudi.DataSourceReadOptions; val hudiIncQueryDF = spark .read() .format(" ...
Read more >Apache Hudi (Incubating) on Amazon EMR - Big Data Platform
Apache Hudi enables incremental data processing, and record-level insert, update, and delete on your Amazon S3 data lake.
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
There isn’t any generic recommended spark duration, it depends on a lot of factors such as amount of data ingested, parallelism of your job, how long it takes for you to write a 256 MB parquet file in your cluster etc which is best controlled by the clients. Having said that, here is what I’m saying :
So in summary, if you are using bulkInsert() repeatedly, you can see the side effect. If you are using upsert() after a bunch of small files are already lying around in your dataset, depending on how many inserts you are performing every batch, it will take sometime for all small files to be padded. Hope this makes things more clear.
@n3nash Thanks for your tips. I will do some tests on our cluster.