question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Some questions about using hudi

See original GitHub issue

Background

  1. I am using non-partitioned copy on write hoodie table.
  2. I am using Datasource API for bulk insert.
  3. I am using spark streaming for incremental upsert( insert, update, delete)

Questions

  1. Which parameter controls the parquet file size? I had set the following parameter but not working. There are lots of files with size 1.x MB after I bulk insert into hoodie table. .option("hoodie.upsert.shuffle.parallelism","200") .option("hoodie.insert.shuffle.parallelism", "100") .option("hoodie.upsert.shuffle.parallelism", "100") .option(HoodieStorageConfig.PARQUET_FILE_MAX_BYTES, 256 * 1024 * 1024) .option(HoodieCompactionConfig.PARQUET_SMALL_FILE_LIMIT_BYTES, 128 * 1024 * 1024)
  2. For incremtal upserts, My workload is about 1000Record per batch. But It costs 2 mins , and most of the time is spent on HoodieBloomIndex.loadInvolvedFiles. The stage have so many parallel. Which parameter controls the parallel.

hudi-bloom-parallel

Issue Analytics

  • State:closed
  • Created 5 years ago
  • Comments:5 (3 by maintainers)

github_iconTop GitHub Comments

3reactions
n3nashcommented, Jan 22, 2019

There isn’t any generic recommended spark duration, it depends on a lot of factors such as amount of data ingested, parallelism of your job, how long it takes for you to write a 256 MB parquet file in your cluster etc which is best controlled by the clients. Having said that, here is what I’m saying :

  1. Using bulkInsert() -> Depending on the spread of your data, this may create some small files. The bulkInsert() API does NOT do small file sizing, so in a scenario where you keep performing bulkInserts() on a dataset, you will end up creating small files.
  2. Using upsert() -> This API takes a bunch of inserts and updates, applies the updates to existing data in files and pads the inserts to existing small files. If the number of inserts << the number of small files, it will take a long time for all the small files to be padded enough to reach 128MB file size.

So in summary, if you are using bulkInsert() repeatedly, you can see the side effect. If you are using upsert() after a bunch of small files are already lying around in your dataset, depending on how many inserts you are performing every batch, it will take sometime for all small files to be padded. Hope this makes things more clear.

0reactions
louisliu318commented, Jan 22, 2019

@n3nash Thanks for your tips. I will do some tests on our cluster.

Read more comments on GitHub >

github_iconTop Results From Across the Web

FAQs | Apache Hudi
FAQs. General​. When is Hudi useful for me or my organization?​. If you are looking to quickly ingest data onto HDFS or cloud...
Read more >
Some questions about using hudi · Issue #552 - GitHub
I am using Datasource API for bulk insert. ... Which parameter controls the parquet file size? I had set the following parameter but...
Read more >
Re: Questions about using Hudi
gmail.com>, wrote: > Hi, > > I have some questions when I try to use Hudi in my company's prod env: > >...
Read more >
Newest 'apache-hudi' Questions - Stack Overflow
I am trying to view some data from Hudi using below code in spark. import org.apache.hudi.DataSourceReadOptions; val hudiIncQueryDF = spark .read() .format(" ...
Read more >
Apache Hudi (Incubating) on Amazon EMR - Big Data Platform
Apache Hudi enables incremental data processing, and record-level insert, update, and delete on your Amazon S3 data lake.
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found