[SUPPORT] The cleaning strategy breaks the reader view completeness
See original GitHub issueCurrent we have some cleaning strategy such as: num_commits
, delta hours
, num_versions
.
Let’s say user use the num_commits
strategy.
And it uses the params:
- max 10 commits to archive
- min 4 commits to keep in alive
- 6 commits to clean
c1 ---- c2 ---- c3 ---- c4 ---- c5 ---- c6 ---- c7---- c8 ---- c9 ---- c10
At c10, the reader starts reading the latest fs view with a file slice that was written in c1,
/+ — fg1_c1.parquet
And the cleaner also starts working in c10 this time, it finds that the num commits > 6 (10 > 6) and all the files that committed in c1 ~ c4 was deleted. And the reader throws FileNotFoundException
.
This problem is common and occurs frequently especially in streaming read mode.(also happens if a batch read job is complex and lasts long time).
We need some mechanisms to ensure the semantic integrity of the read view.
Issue Analytics
- State:
- Created 2 years ago
- Comments:18 (18 by maintainers)
Top GitHub Comments
My description was not that accurate, because we use the Snapshot Isolation, when the reader starts reading the C1 file slice
s1
where it is the latest in the file group at C9, a subsequent C10 then modifies the C1 file slice and the cleaner starts working, thes1
would be cleaned.@danny0405 : going back to your original example in the description. If a file slice was written in C1, and never updated in any of the future commits, at C10 or C11, even if cleaner detects all data files pertaining to C1 to C4 needs to be deleted, latest file slice in C1 will never be touched. Cleaner will always ensure latest file slice for any file group will never be cleaned up. So, can you help understand why we might see FileNotFoundIssue?