Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Using AdBlock rules to remove elements

See original GitHub issue

AdBlock Plus element hiding rules specify elements to exclude and are specified by CSS selectors. This is easily implemented in lxml, if somewhat slowly.

I’m using this in my own code to automatically remove social media share links from pages. You may want to consider including something similar in python-readablity.

EasyList is dual licensed Creative Commons Attribution-ShareAlike 3.0 Unported and GNU General Public License version 3. CC-BY-SA looks compatible with Apache licensed projects.

Example

First download the rules:

$ wget https://easylist-downloads.adblockplus.org/fanboy-annoyance.txt

Then you can simply extract the CSS selectors to match against a document tree.

from lxml import html
from lxml.cssselect import CSSSelector

RULES_PATH = 'fanboy-annoyance.txt'
with open(RULES_PATH, 'r') as f:
    lines = f.read().splitlines()

# get elemhide rules (prefixed by ##) and create a CSSSelector for each of them
rules = [CSSSelector(line[2:]) for line in lines if line[:2] == '##']

def remove_ads(tree):
    for rule in rules:
        for matched in rule(tree):
            matched.getparent().remove(matched)

doc = html.document_fromstring("<html>...</html>")
remove_ads(doc)

Issue Analytics

State:
Created 10 years ago
Reactions:3
Comments:12 (2 by maintainers)

Top GitHub Comments

5reactions

bburkycommented, Aug 9, 2017

@azhard4int Yes, you can regenerate the CSSSelector objects each time to save memory, but you’re trading memory usage for performance. It is very slow to recreate the CSSSelector objects every time you process a document.

Instead, how about we just extract the xpath query that was generated by cssselector, and join them all together with the xpath | (or) operator. Storing the single large xpath query string isn’t nearly as bad as storing a list of CSSSelector objects. Then we can easily check if any rule from the entire list matched at once, and delete the matched item.

Here’s a new implementation that uses that approach. I only did a little testing of it, but it seems to work fine.

import cssselect

class AdRemover(object):
    """
    This class applies elemhide rules from AdBlock Plus to an lxml
    document or element object. One or more AdBlock Plus filter
    subscription files must be provided.

    Example usage:

    >>> import lxml.html
    >>> remover = AdRemover('fanboy-annoyance.txt')
    >>> doc = lxml.html.document_fromstring("<html>...</html>")
    >>> remover.remove_ads(doc)
    """

    def __init__(self, *rules_files):
        if not rules_files:
            raise ValueError("one or more rules_files required")

        translator = cssselect.HTMLTranslator()
        rules = []

        for rules_file in rules_files:
            with open(rules_file, 'r') as f:
                for line in f:
                    # elemhide rules are prefixed by ## in the adblock filter syntax
                    if line[:2] == '##':
                        try:
                            rules.append(translator.css_to_xpath(line[2:]))
                        except cssselect.SelectorError:
                            # just skip bad selectors
                            pass

        # create one large query by joining them the xpath | (or) operator
        self.xpath_query = '|'.join(rules)


    def remove_ads(self, tree):
        """Remove ads from an lxml document or element object.

        The object passed to this method will be modified in place."""

        for elem in tree.xpath(self.xpath_query):
            elem.getparent().remove(elem)

2reactions

azhard4intcommented, Aug 7, 2017

Just to give an update for anyone who is using above method for the removal of adblock tags, don’t store the rules of CSS selector into the list

rules = [CSSSelector(line[2:]) for line in lines if line[:2] == '##']

It would consume nearly 200-250 MB ram and instead you can use the conditional statement inside the loop which will consume 5-10 MB only.

Below is the memory usage statistics:


Line #    Mem usage    Increment   Line Contents
================================================
   279     79.8 MiB      0.0 MiB       @profile
   280                                 def test_fanboy_content(self):
   281     79.8 MiB      0.0 MiB           from lxml.cssselect import CSSSelector
   282     79.8 MiB      0.0 MiB           from project.settings import ADBLOCK_RULES_PATH, ALREADY_MADE_RULES
   283     79.8 MiB      0.0 MiB           RULES_PATH = ADBLOCK_RULES_PATH
   284                             
   287     79.8 MiB      0.0 MiB           with open(RULES_PATH, 'r') as f:
   288     81.0 MiB      1.2 MiB               lines = f.read().splitlines()
   289     81.0 MiB      0.0 MiB           f.close()
   290                             
   291                                     # get elemhide rules (prefixed by ##) and create a CSSSelector for each of them
   292    282.0 MiB    201.0 MiB           rules = [CSSSelector(line[2:]) for line in lines if line[:2] == '##']