Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Some questions about using hudi

See original GitHub issue

Background

I am using non-partitioned copy on write hoodie table.
I am using Datasource API for bulk insert.
I am using spark streaming for incremental upsert( insert, update, delete)

Questions

Which parameter controls the parquet file size? I had set the following parameter but not working. There are lots of files with size 1.x MB after I bulk insert into hoodie table. .option("hoodie.upsert.shuffle.parallelism","200") .option("hoodie.insert.shuffle.parallelism", "100") .option("hoodie.upsert.shuffle.parallelism", "100") .option(HoodieStorageConfig.PARQUET_FILE_MAX_BYTES, 256 * 1024 * 1024) .option(HoodieCompactionConfig.PARQUET_SMALL_FILE_LIMIT_BYTES, 128 * 1024 * 1024)
For incremtal upserts, My workload is about 1000Record per batch. But It costs 2 mins , and most of the time is spent on HoodieBloomIndex.loadInvolvedFiles. The stage have so many parallel. Which parameter controls the parallel.

hudi-bloom-parallel

Issue Analytics

State:
Created 5 years ago
Comments:5 (3 by maintainers)

Top GitHub Comments

3reactions

n3nashcommented, Jan 22, 2019

There isn’t any generic recommended spark duration, it depends on a lot of factors such as amount of data ingested, parallelism of your job, how long it takes for you to write a 256 MB parquet file in your cluster etc which is best controlled by the clients. Having said that, here is what I’m saying :

Using bulkInsert() -> Depending on the spread of your data, this may create some small files. The bulkInsert() API does NOT do small file sizing, so in a scenario where you keep performing bulkInserts() on a dataset, you will end up creating small files.
Using upsert() -> This API takes a bunch of inserts and updates, applies the updates to existing data in files and pads the inserts to existing small files. If the number of inserts << the number of small files, it will take a long time for all the small files to be padded enough to reach 128MB file size.

So in summary, if you are using bulkInsert() repeatedly, you can see the side effect. If you are using upsert() after a bunch of small files are already lying around in your dataset, depending on how many inserts you are performing every batch, it will take sometime for all small files to be padded. Hope this makes things more clear.

0reactions

louisliu318commented, Jan 22, 2019

@n3nash Thanks for your tips. I will do some tests on our cluster.

Top Results From Across the Web

FAQs | Apache Hudi

FAQs. General. When is Hudi useful for me or my organization?. If you are looking to quickly ingest data onto HDFS or cloud...

Some questions about using hudi · Issue #552 - GitHub

I am using Datasource API for bulk insert. ... Which parameter controls the parquet file size? I had set the following parameter but...

Re: Questions about using Hudi

gmail.com>, wrote: > Hi, > > I have some questions when I try to use Hudi in my company's prod env: > >...

Newest 'apache-hudi' Questions - Stack Overflow

I am trying to view some data from Hudi using below code in spark. import org.apache.hudi.DataSourceReadOptions; val hudiIncQueryDF = spark .read() .format(" ...

Apache Hudi (Incubating) on Amazon EMR - Big Data Platform

Apache Hudi enables incremental data processing, and record-level insert, update, and delete on your Amazon S3 data lake.