question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Using AdBlock rules to remove elements

See original GitHub issue

AdBlock Plus element hiding rules specify elements to exclude and are specified by CSS selectors. This is easily implemented in lxml, if somewhat slowly.

I’m using this in my own code to automatically remove social media share links from pages. You may want to consider including something similar in python-readablity.

EasyList is dual licensed Creative Commons Attribution-ShareAlike 3.0 Unported and GNU General Public License version 3. CC-BY-SA looks compatible with Apache licensed projects.

Example

First download the rules:

$ wget https://easylist-downloads.adblockplus.org/fanboy-annoyance.txt

Then you can simply extract the CSS selectors to match against a document tree.

from lxml import html
from lxml.cssselect import CSSSelector

RULES_PATH = 'fanboy-annoyance.txt'
with open(RULES_PATH, 'r') as f:
    lines = f.read().splitlines()

# get elemhide rules (prefixed by ##) and create a CSSSelector for each of them
rules = [CSSSelector(line[2:]) for line in lines if line[:2] == '##']

def remove_ads(tree):
    for rule in rules:
        for matched in rule(tree):
            matched.getparent().remove(matched)

doc = html.document_fromstring("<html>...</html>")
remove_ads(doc)

Issue Analytics

  • State:open
  • Created 10 years ago
  • Reactions:3
  • Comments:12 (2 by maintainers)

github_iconTop GitHub Comments

5reactions
bburkycommented, Aug 9, 2017

@azhard4int Yes, you can regenerate the CSSSelector objects each time to save memory, but you’re trading memory usage for performance. It is very slow to recreate the CSSSelector objects every time you process a document.

Instead, how about we just extract the xpath query that was generated by cssselector, and join them all together with the xpath | (or) operator. Storing the single large xpath query string isn’t nearly as bad as storing a list of CSSSelector objects. Then we can easily check if any rule from the entire list matched at once, and delete the matched item.

Here’s a new implementation that uses that approach. I only did a little testing of it, but it seems to work fine.

import cssselect

class AdRemover(object):
    """
    This class applies elemhide rules from AdBlock Plus to an lxml
    document or element object. One or more AdBlock Plus filter
    subscription files must be provided.

    Example usage:

    >>> import lxml.html
    >>> remover = AdRemover('fanboy-annoyance.txt')
    >>> doc = lxml.html.document_fromstring("<html>...</html>")
    >>> remover.remove_ads(doc)
    """

    def __init__(self, *rules_files):
        if not rules_files:
            raise ValueError("one or more rules_files required")

        translator = cssselect.HTMLTranslator()
        rules = []

        for rules_file in rules_files:
            with open(rules_file, 'r') as f:
                for line in f:
                    # elemhide rules are prefixed by ## in the adblock filter syntax
                    if line[:2] == '##':
                        try:
                            rules.append(translator.css_to_xpath(line[2:]))
                        except cssselect.SelectorError:
                            # just skip bad selectors
                            pass

        # create one large query by joining them the xpath | (or) operator
        self.xpath_query = '|'.join(rules)


    def remove_ads(self, tree):
        """Remove ads from an lxml document or element object.

        The object passed to this method will be modified in place."""

        for elem in tree.xpath(self.xpath_query):
            elem.getparent().remove(elem)
2reactions
azhard4intcommented, Aug 7, 2017

Just to give an update for anyone who is using above method for the removal of adblock tags, don’t store the rules of CSS selector into the list

rules = [CSSSelector(line[2:]) for line in lines if line[:2] == '##']

It would consume nearly 200-250 MB ram and instead you can use the conditional statement inside the loop which will consume 5-10 MB only.

Below is the memory usage statistics:


Line #    Mem usage    Increment   Line Contents
================================================
   279     79.8 MiB      0.0 MiB       @profile
   280                                 def test_fanboy_content(self):
   281     79.8 MiB      0.0 MiB           from lxml.cssselect import CSSSelector
   282     79.8 MiB      0.0 MiB           from project.settings import ADBLOCK_RULES_PATH, ALREADY_MADE_RULES
   283     79.8 MiB      0.0 MiB           RULES_PATH = ADBLOCK_RULES_PATH
   284                             
   287     79.8 MiB      0.0 MiB           with open(RULES_PATH, 'r') as f:
   288     81.0 MiB      1.2 MiB               lines = f.read().splitlines()
   289     81.0 MiB      0.0 MiB           f.close()
   290                             
   291                                     # get elemhide rules (prefixed by ##) and create a CSSSelector for each of them
   292    282.0 MiB    201.0 MiB           rules = [CSSSelector(line[2:]) for line in lines if line[:2] == '##']

Read more comments on GitHub >

github_iconTop Results From Across the Web

Block a specific element - Adblock Plus
Open the webpage with the element you want to block. · From the Edge toolbar, click the Adblock Plus icon and select Block...
Read more >
Firefox adblock plus. How to remove element instead hiding it?
I am using tampermonkey. This script is for deleting videos instead of hiding them. The example is specific for one site, but by...
Read more >
How to block this element with an adblocker - Stack Overflow
I managed to block a jpg by putting it's address into the filter list. But blocking classes does not seem to do anything....
Read more >
Ultimate Guide to Ad-Blocking Filter Lists - Comparitech
The EasyList filter lists are rule sets that were originally designed for AdBlock (not Adblock Plus). They aim to automatically remove ads ...
Read more >
How to Disable Ad Blocker Detection on Any Website
Adblockers perform their job via blocking communication with ad-forcing servers, and by hiding the webpage elements containing commercial ...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found