Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

PeriodLoadRule cannot Remove expired segment

See original GitHub issue

Recently, when deploying the cold/hot layered Druid cluster, it was found that a hot node loaded data beyond the time range, resulting in the hot node’s storage being full soon. I found the same problem on the Druid forum page, which has not been handled by anyone for a long time. I checked RunRules.java, I feel there is a problem. Periodloadrule will not delete expired data at all, but only delete too many replicants. Does the current implementation of PeriodLoadRule meet expectations?

The following is the current implementation of druid:

//RunRules.run
      for (Rule rule : rules) {
        if (rule.appliesTo(segment, now)) {
          if (
              stats.getGlobalStat(
                  "totalNonPrimaryReplicantsLoaded") >= paramsWithReplicationManager.getCoordinatorDynamicConfig()
                                                                                   .getMaxNonPrimaryReplicantsToLoad()
              && !paramsWithReplicationManager.getReplicationManager().isLoadPrimaryReplicantsOnly()
          ) {
            log.info(
                "Maximum number of non-primary replicants [%d] have been loaded for the current RunRules execution. Only loading primary replicants from here on for this coordinator run cycle.",
                paramsWithReplicationManager.getCoordinatorDynamicConfig().getMaxNonPrimaryReplicantsToLoad()
            );
            paramsWithReplicationManager.getReplicationManager().setLoadPrimaryReplicantsOnly(true);
          }
          stats.accumulate(rule.run(coordinator, paramsWithReplicationManager, segment));
          foundMatchingRule = true;
          break;
        }
      }

Now, I have solved this problem by adding dropallExpireSegments to PeriodLoadRule.java, but I don’t know what bad effect it will have.

Here is my implementation:

//RunRules.run
      for (Rule rule : rules) {
        if (rule.appliesTo(segment, now)) {
          if (
              stats.getGlobalStat(
                  "totalNonPrimaryReplicantsLoaded") >= paramsWithReplicationManager.getCoordinatorDynamicConfig()
                                                                                   .getMaxNonPrimaryReplicantsToLoad()
              && !paramsWithReplicationManager.getReplicationManager().isLoadPrimaryReplicantsOnly()
          ) {
            log.info(
                "Maximum number of non-primary replicants [%d] have been loaded for the current RunRules execution. Only loading primary replicants from here on for this coordinator run cycle.",
                paramsWithReplicationManager.getCoordinatorDynamicConfig().getMaxNonPrimaryReplicantsToLoad()
            );
            paramsWithReplicationManager.getReplicationManager().setLoadPrimaryReplicantsOnly(true);
          }
          stats.accumulate(rule.run(coordinator, paramsWithReplicationManager, segment));
          foundMatchingRule = true;
          break;
        }else{
          //Add Delete Logic,Only implement dropAllExpireSegments in PeriodLoadRule
          rule.dropAllExpireSegments(paramsWithReplicationManager,segment);
        }
      }

Affected Version

0.22.0

Issue Analytics

State:
Created a year ago
Comments:14 (14 by maintainers)

Top GitHub Comments

2reactions

kfarazcommented, Sep 16, 2022

@599166320 , drop is handled by DropRules like a ForeverDropRule, none of the LoadRules are supposed to have that capability. You typically specify a bunch of rules for each datasource. The coordinator tries to find the first rule which applies to a given segment at a given time and tries to do what that matched rule suggests. If at any point in the lifetime of a segment, it matches with a DropRule, it gets dropped.

So, I think the problem you are facing can be solved by simply having a ForeverDropRule at the end of your retention rule list (default or datasource-specific). Please let us know if this works for you.

1reaction

599166320commented, Sep 19, 2022

@kfaraz The cold tier of our cluster has enough historicals. The current problem is that we don’t want to _default_tier’s storage is full too fast.

I will create a PR and let you review it.