Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Do we need to change our blacklisting guidelines?

See original GitHub issue

When we wrote our blacklisting guidelines in October last year, we set the following requirements:

Website has been used in at least 2 confirmed instances (tp fed back reports) of spam (You can use https://metasmoke.erwaysoftware.com/search to find other instances of a website being used for spam).

Website is not used legitimately in other posts on Stack Exchange.

Website is not currently caught in any of these filters: - bad keyword in body - blacklisted website - pattern matching website

Circumstances have changed since then, and the number of blacklists has grown. With the addition of the !!/blacklist-* commands, over 830 more websites/keywords/usernames have been added to our blacklists. In fact, 106 (!!!) were made in the last five days alone. Many of these websites are already caught by one or two of the reasons specified above.

Considering this, I think we need to have a discussion over whether these guidelines need to be changed to reflect the way we should/are using blacklists now. What should our new guidelines be?

Do we want to be blacklisting every spammy site that we see? Do we want to leave it to extreme circumstances?
Should we instead focus our time on improving our pattern-matching-* reasons?
Should average autoflag weight of matched posts have anything to do with this?
Should manually reported/posts with only 1 reason be given extra weight when counting the need for a blacklist?
Are our current guidelines just fine, and do we just need to enforce them more?

Other things we should think about:

If we are going to blacklist everything, do we want to automate it somehow?
What sort of a performance hit does blacklisting make? (I think Art ran some stats on this a while ago, maybe they need to be re-run with the updated codebase)
How much do dormant blacklists clutter the list? Do we need to think about code readability?
Should blacklist entries be removed if they don’t have any hits after a certain time?
What does the !!/watch-keyword command have to do with this? Should it follow similar guidelines? Should it have separate ones? Do we need to change the way that it is implemented, to give it 0 weight or not send reports to MS?

What does everyone think about this?

Issue Analytics

State:
Created 6 years ago
Comments:29 (29 by maintainers)

Top GitHub Comments

5reactions

tripleeecommented, Jun 21, 2017

Here is now my attempt at a final proposal. I have not received any feedback on the limits so I leave them at the proposed numbers.

This is basically identical to the proposal from a month ago, with the amendment to allow for blacklisting domain names with substantial evidence from more than 6 months back but few recent hits. Also, the keyword blacklisting requirement is now at least two hits.

The bullet points with Rationale: are for background, and I don’t think they need to be included in the wiki documentation eventually.

watched_keywords – anything is game, but be prepared to have it removed if circumstances require it.
- We will be removing patterns periodically; you can reduce the risk of having useful patterns removed by proactively removing patterns you no longer are interested in, or which produce very uncertain value.
  - Rationale: We want to prevent the list from growing indefinitely. Eventually, there should be automated expiration for patterns with no hits or otherwise low value.
- Autoflagging weight for this reason is technically forced to stay at 1.
  - Rationale: The watch list is for guiding our analysis, not for necessarily identifying spam. A small weight helps prevent autoflagging false positives.
- Smoke Detector will regard these rules as “experimental”; it will not alert in other rooms than Charcoal HQ if there are hits solely from this set of rules.
  - Rationale: Again, the analysis and development work happens in the Charcoal room, and alerts are only useful there.
blacklisted_websites.txt – reserved for sites which we are highly confident that are used only in spam. You may add a site to this list if one of the following is true.
- The site has at least five hits in Metasmoke, with no false positives, and at least one of them is below the default autoflagging threshold (currently, 280) and no older than six months.
  - Rationale: We want to avoid bloating the blacklist with transient domains which pop up, run a campaign with a handful of spam posts, and then quickly disappear forever.
- There are more than twenty hits in the last six months, and no false positives.
  - Rationale: With this amount of spam, the domain is arguably not a quick whack-a-mole one-off. Indeed, getting rid of this type of spam is one of the central goals of the Charcoal project.
- There are recent hits, and more than 30 hits overall, and no false positives.
  - Rationale: With this amount of spam, the domain is clearly a fixture, and it might be making a return after having been dormant. Again, this type of spam should not require any significant human intervention, and explicitly blacklisting the domain helps avoid spending analysis time on this already well-established fact.
bad_keywords.txt – reserved for phrases which we are highly confident that are used only in spam. You may add a phrase to this list if the following is true.
- The phrase has been used repeatedly in recent spam and has no false positives in Metasmoke, and searching on Stack Exchange indicates that it is not a common phrase on any site in the network.
  - Rationale: While this is more relaxed than the other rules, it codifies current practice, and documents what is currently undocumented.

1reaction

Undo1commented, May 22, 2017

We used to have a tiny bit of this enforced (or at least automatically checked) by metasmoke, but that broke at some point and I haven’t fixed it. I’m skeptical as always of any arguments involving blacklist speed, as I haven’t seen data saying they’re a Real Issue (but we do have data saying they aren’t).

A good step (possibly a prereq) in automating this would be the ability to say which line / blacklist item a match came from, and store that in metasmoke. That’d make higher order heuristics much more possible.

I agree with Andy on everything else. Need to get some data-driven processes around this or we’re going to bikeshed on every blacklist.

Top Results From Across the Web

USDA Proposes Contractor Blacklisting Rule for Its Contractors

The proposed rule would broadly exclude any contractor, subcontractor or supplier that has violated any labor law, including (but not limited to) ...

Government Releases Final Rule Implementing “Blacklisting ...

The rule as initially proposed would have required prospective subcontractors to make their disclosures to the prime contractor for assessment.

The Government Is Blacklisting People Based on Predictions ...

The government eventually revealed that the criteria it uses to ban people from flying are all based on its view that they are...

THE BLACKLISTING EXECUTIVE ORDER

The executive order will require employers to report instances in which they or their subcontractors have violated or allegedly violated various federal labor ......

Does the threat of being blacklisted change behavior?

Matthew Collin investigates the impact that the EU's grey and black lists for tax governance have had on the standards that the EU...