question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

[SUPPORT] The cleaning strategy breaks the reader view completeness

See original GitHub issue

Current we have some cleaning strategy such as: num_commits, delta hours, num_versions. Let’s say user use the num_commits strategy.

And it uses the params:

  • max 10 commits to archive
  • min 4 commits to keep in alive
  • 6 commits to clean

c1 ---- c2 ---- c3 ---- c4 ---- c5 ---- c6 ---- c7---- c8 ---- c9 ---- c10

At c10, the reader starts reading the latest fs view with a file slice that was written in c1,

/+ — fg1_c1.parquet

And the cleaner also starts working in c10 this time, it finds that the num commits > 6 (10 > 6) and all the files that committed in c1 ~ c4 was deleted. And the reader throws FileNotFoundException.

This problem is common and occurs frequently especially in streaming read mode.(also happens if a batch read job is complex and lasts long time).

We need some mechanisms to ensure the semantic integrity of the read view.

Issue Analytics

  • State:closed
  • Created 2 years ago
  • Comments:18 (18 by maintainers)

github_iconTop GitHub Comments

1reaction
danny0405commented, May 10, 2022

@danny0405 : going back to your original example in the description. If a file slice was written in C1, and never updated in any of the future commits, at C10 or C11, even if cleaner detects all data files pertaining to C1 to C4 needs to be deleted, latest file slice in C1 will never be touched. Cleaner will always ensure latest file slice for any file group will never be cleaned up. So, can you help understand why we might see FileNotFoundIssue?

My description was not that accurate, because we use the Snapshot Isolation, when the reader starts reading the C1 file slice s1 where it is the latest in the file group at C9, a subsequent C10 then modifies the C1 file slice and the cleaner starts working, the s1 would be cleaned.

1reaction
nsivabalancommented, May 9, 2022

@danny0405 : going back to your original example in the description. If a file slice was written in C1, and never updated in any of the future commits, at C10 or C11, even if cleaner detects all data files pertaining to C1 to C4 needs to be deleted, latest file slice in C1 will never be touched. Cleaner will always ensure latest file slice for any file group will never be cleaned up. So, can you help understand why we might see FileNotFoundIssue?

Read more comments on GitHub >

github_iconTop Results From Across the Web

What is Data Cleansing (Data Cleaning, Data Scrubbing)?
Data cleansing or scrubbing is the process of fixing errors and other issues in data sets. Learn about the data cleansing process and...
Read more >
Best Practices in Data Cleaning
Best practices in data cleaning : a complete guide to everything you need to do before and after collecting your data / Jason...
Read more >
10 Strategies for Better Time Management - UGA Extension
Learn 10 strategies for better time management, including knowing how to spend your time, setting priorities, using planning tools, getting organized, ...
Read more >
Data Migration Testing Tutorial: A Complete Guide
Data Migration Testing is the migration of legacy system to the new system with minimal disruption/downtime and no loss of data.
Read more >
Note-Taking - EDUC 1300: Effective Learning Strategies
Determine what to do with your notes after the course is complete. ... expect you to make connections between class lectures and reading...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found