Do we need to change our blacklisting guidelines?
See original GitHub issueWhen we wrote our blacklisting guidelines in October last year, we set the following requirements:
- Website has been used in at least 2 confirmed instances (tp fed back reports) of spam (You can use https://metasmoke.erwaysoftware.com/search to find other instances of a website being used for spam).
- Website is not used legitimately in other posts on Stack Exchange.
- Website is not currently caught in any of these filters: - bad keyword in body - blacklisted website - pattern matching website
Circumstances have changed since then, and the number of blacklists has grown. With the addition of the !!/blacklist-*
commands, over 830 more websites/keywords/usernames have been added to our blacklists. In fact, 106 (!!!) were made in the last five days alone. Many of these websites are already caught by one or two of the reasons specified above.
Considering this, I think we need to have a discussion over whether these guidelines need to be changed to reflect the way we should/are using blacklists now. What should our new guidelines be?
- Do we want to be blacklisting every spammy site that we see? Do we want to leave it to extreme circumstances?
- Should we instead focus our time on improving our pattern-matching-* reasons?
- Should average autoflag weight of matched posts have anything to do with this?
- Should manually reported/posts with only 1 reason be given extra weight when counting the need for a blacklist?
- Are our current guidelines just fine, and do we just need to enforce them more?
Other things we should think about:
- If we are going to blacklist everything, do we want to automate it somehow?
- What sort of a performance hit does blacklisting make? (I think Art ran some stats on this a while ago, maybe they need to be re-run with the updated codebase)
- How much do dormant blacklists clutter the list? Do we need to think about code readability?
- Should blacklist entries be removed if they don’t have any hits after a certain time?
- What does the
!!/watch-keyword
command have to do with this? Should it follow similar guidelines? Should it have separate ones? Do we need to change the way that it is implemented, to give it 0 weight or not send reports to MS?
What does everyone think about this?
Issue Analytics
- State:
- Created 6 years ago
- Comments:29 (29 by maintainers)
Top Results From Across the Web
USDA Proposes Contractor Blacklisting Rule for Its Contractors
The proposed rule would broadly exclude any contractor, subcontractor or supplier that has violated any labor law, including (but not limited to) ...
Read more >Government Releases Final Rule Implementing “Blacklisting ...
The rule as initially proposed would have required prospective subcontractors to make their disclosures to the prime contractor for assessment.
Read more >The Government Is Blacklisting People Based on Predictions ...
The government eventually revealed that the criteria it uses to ban people from flying are all based on its view that they are...
Read more >THE BLACKLISTING EXECUTIVE ORDER
The executive order will require employers to report instances in which they or their subcontractors have violated or allegedly violated various federal labor ......
Read more >Does the threat of being blacklisted change behavior?
Matthew Collin investigates the impact that the EU's grey and black lists for tax governance have had on the standards that the EU...
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
Here is now my attempt at a final proposal. I have not received any feedback on the limits so I leave them at the proposed numbers.
This is basically identical to the proposal from a month ago, with the amendment to allow for blacklisting domain names with substantial evidence from more than 6 months back but few recent hits. Also, the keyword blacklisting requirement is now at least two hits.
The bullet points with Rationale: are for background, and I don’t think they need to be included in the wiki documentation eventually.
watched_keywords
– anything is game, but be prepared to have it removed if circumstances require it.blacklisted_websites.txt
– reserved for sites which we are highly confident that are used only in spam. You may add a site to this list if one of the following is true.bad_keywords.txt
– reserved for phrases which we are highly confident that are used only in spam. You may add a phrase to this list if the following is true.We used to have a tiny bit of this enforced (or at least automatically checked) by metasmoke, but that broke at some point and I haven’t fixed it. I’m skeptical as always of any arguments involving blacklist speed, as I haven’t seen data saying they’re a Real Issue (but we do have data saying they aren’t).
A good step (possibly a prereq) in automating this would be the ability to say which line / blacklist item a match came from, and store that in metasmoke. That’d make higher order heuristics much more possible.
I agree with Andy on everything else. Need to get some data-driven processes around this or we’re going to bikeshed on every blacklist.