question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Can Hudi pull data for incremental queries with the latest snapshot alone?

See original GitHub issue

I am PoCing Hudi and had a couple of questions -

  1. Does Hudi allow to perform incremental queries from a previous commit time to the latest snapshot while maintaining just the current version of the data say by setting "hoodie.cleaner.commits.retained": "1" and "hoodie.cleaner.fileversions.retained": "1".
  2. What are the differences between these two configuration options "hoodie.cleaner.commits.retained" and "hoodie.cleaner.fileversions.retained"?

Thanks in advance.

Issue Analytics

  • State:closed
  • Created 2 years ago
  • Comments:5 (3 by maintainers)

github_iconTop GitHub Comments

1reaction
n3nashcommented, May 31, 2021

@ChandraNarreddy Sorry for the delayed response. Hudi only keeps track of what records have changed. This is known by the _hoodie_commit_time value between 2 records. It is not possible to identify whether the changes for the records were inserts/updates/deletes just by looking at this timestamp.

Hudi can only keep versions of data if the config is set to keep those many versions of the data. Suppose a record changes 10 times, this means 10 versions of that record were created. Now, only if the cleaner config ensures that 10 versions of the data are kept can Hudi allow for users to look at the historical values of the data. Hudi’s incremental pull feature is not designed to provide a change log of all these 10 values, it is meant to provide the latest state of record since the last time you incrementally pulled.

Yes, hoodie.cleaner.fileversions.retained and hoodie.cleaner.commits.retained are independent of each other are are different cleaning policies. This blog from @pratyakshsharma is very helpful to understand this. https://github.com/apache/hudi/pull/2967

Please feel free to re-open if you have any other questions.

0reactions
ChandraNarreddycommented, May 19, 2021

@n3nash, would you be able to address my follow up questions? Thanks in advance.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Concepts - Apache Hudi
Query types​​​ Incremental Queries : Queries only see new data written to the table, since a given commit/compaction. This effectively provides change streams...
Read more >
Querying Data - Apache Hudi
HiveIncrementalPuller allows incrementally extracting changes from large fact/dimension tables via HiveQL, combining the benefits of Hive (reliably process ...
Read more >
Table & Query Types - Apache Hudi
Snapshot Queries : Queries see the latest snapshot of the table as of a given commit or compaction action. · Incremental Queries :...
Read more >
Spark Guide - Apache Hudi
Hudi supports Spark Structured Streaming reads and writes. Structured Streaming reads are based on Hudi Incremental Query feature, therefore streaming read can ......
Read more >
FAQs - Apache Hudi
As of September 2019, Hudi can support Spark 2.1+, Hive 2.x, Hadoop 2.7+ (not Hadoop 3). How does Hudi actually store data inside...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found