Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Can Hudi pull data for incremental queries with the latest snapshot alone?

See original GitHub issue

I am PoCing Hudi and had a couple of questions -

Does Hudi allow to perform incremental queries from a previous commit time to the latest snapshot while maintaining just the current version of the data say by setting "hoodie.cleaner.commits.retained": "1" and "hoodie.cleaner.fileversions.retained": "1".
What are the differences between these two configuration options "hoodie.cleaner.commits.retained" and "hoodie.cleaner.fileversions.retained"?

Thanks in advance.

Issue Analytics

State:
Created 2 years ago
Comments:5 (3 by maintainers)

Top GitHub Comments

1reaction

n3nashcommented, May 31, 2021

@ChandraNarreddy Sorry for the delayed response. Hudi only keeps track of what records have changed. This is known by the _hoodie_commit_time value between 2 records. It is not possible to identify whether the changes for the records were inserts/updates/deletes just by looking at this timestamp.

Hudi can only keep versions of data if the config is set to keep those many versions of the data. Suppose a record changes 10 times, this means 10 versions of that record were created. Now, only if the cleaner config ensures that 10 versions of the data are kept can Hudi allow for users to look at the historical values of the data. Hudi’s incremental pull feature is not designed to provide a change log of all these 10 values, it is meant to provide the latest state of record since the last time you incrementally pulled.

Yes, hoodie.cleaner.fileversions.retained and hoodie.cleaner.commits.retained are independent of each other are are different cleaning policies. This blog from @pratyakshsharma is very helpful to understand this. https://github.com/apache/hudi/pull/2967

Please feel free to re-open if you have any other questions.

0reactions

ChandraNarreddycommented, May 19, 2021

@n3nash, would you be able to address my follow up questions? Thanks in advance.

Top Results From Across the Web

Concepts - Apache Hudi

Query types Incremental Queries : Queries only see new data written to the table, since a given commit/compaction. This effectively provides change streams...

Querying Data - Apache Hudi

HiveIncrementalPuller allows incrementally extracting changes from large fact/dimension tables via HiveQL, combining the benefits of Hive (reliably process ...

Table & Query Types - Apache Hudi

Snapshot Queries : Queries see the latest snapshot of the table as of a given commit or compaction action. · Incremental Queries :...

Spark Guide - Apache Hudi

Hudi supports Spark Structured Streaming reads and writes. Structured Streaming reads are based on Hudi Incremental Query feature, therefore streaming read can ......

FAQs - Apache Hudi

As of September 2019, Hudi can support Spark 2.1+, Hive 2.x, Hadoop 2.7+ (not Hadoop 3). How does Hudi actually store data inside...