Can Hudi pull data for incremental queries with the latest snapshot alone?
See original GitHub issueI am PoCing Hudi and had a couple of questions -
- Does Hudi allow to perform incremental queries from a previous commit time to the latest snapshot while maintaining just the current version of the data say by setting
"hoodie.cleaner.commits.retained": "1"
and"hoodie.cleaner.fileversions.retained": "1"
. - What are the differences between these two configuration options
"hoodie.cleaner.commits.retained"
and"hoodie.cleaner.fileversions.retained"
?
Thanks in advance.
Issue Analytics
- State:
- Created 2 years ago
- Comments:5 (3 by maintainers)
Top Results From Across the Web
Concepts - Apache Hudi
Query types Incremental Queries : Queries only see new data written to the table, since a given commit/compaction. This effectively provides change streams...
Read more >Querying Data - Apache Hudi
HiveIncrementalPuller allows incrementally extracting changes from large fact/dimension tables via HiveQL, combining the benefits of Hive (reliably process ...
Read more >Table & Query Types - Apache Hudi
Snapshot Queries : Queries see the latest snapshot of the table as of a given commit or compaction action. · Incremental Queries :...
Read more >Spark Guide - Apache Hudi
Hudi supports Spark Structured Streaming reads and writes. Structured Streaming reads are based on Hudi Incremental Query feature, therefore streaming read can ......
Read more >FAQs - Apache Hudi
As of September 2019, Hudi can support Spark 2.1+, Hive 2.x, Hadoop 2.7+ (not Hadoop 3). How does Hudi actually store data inside...
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
@ChandraNarreddy Sorry for the delayed response. Hudi only keeps track of what records have changed. This is known by the _hoodie_commit_time value between 2 records. It is not possible to identify whether the changes for the records were inserts/updates/deletes just by looking at this timestamp.
Hudi can only keep versions of data if the config is set to keep those many versions of the data. Suppose a record changes 10 times, this means 10 versions of that record were created. Now, only if the cleaner config ensures that 10 versions of the data are kept can Hudi allow for users to look at the historical values of the data. Hudi’s incremental pull feature is not designed to provide a change log of all these 10 values, it is meant to provide the latest state of record since the last time you incrementally pulled.
Yes, hoodie.cleaner.fileversions.retained and hoodie.cleaner.commits.retained are independent of each other are are different cleaning policies. This blog from @pratyakshsharma is very helpful to understand this. https://github.com/apache/hudi/pull/2967
Please feel free to re-open if you have any other questions.
@n3nash, would you be able to address my follow up questions? Thanks in advance.