Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

custom xpath support

See original GitHub issue

custom xpath functions could be added here? like:

# Original Source: https://gist.github.com/shirk3y/458224083ce5464627bc
from lxml import etree

CLASS_EXPR = "contains(concat(' ', normalize-space(@class), ' '), ' {} ')"

def has_class(context, *classes):
    """
    This lxml extension allows to select by CSS class more easily
    >>> ns = etree.FunctionNamespace(None)
    >>> ns['has-class'] = has_class
    >>> root = etree.XML('''
    ... <a>
    ...     <b class="one first text">I</b>
    ...     <b class="two text">LOVE</b>
    ...     <b class="three text">CSS</b>
    ... </a>
    ... ''')
    >>> len(root.xpath('//b[has-class("text")]'))
    3
    >>> len(root.xpath('//b[has-class("one")]'))
    1
    >>> len(root.xpath('//b[has-class("text", "first")]'))
    1
    >>> len(root.xpath('//b[not(has-class("first"))]'))
    2
    >>> len(root.xpath('//b[has-class("not-exists")]'))
    0
    """

    expressions = ' and '.join([CLASS_EXPR.format(c) for c in classes])
    xpath = 'self::*[@class and {}]'.format(expressions)
    return bool(context.context_node.xpath(xpath))

I think it is a common practice to create custom xpaths on different projects.

Issue Analytics

State:
Created 8 years ago
Reactions:1
Comments:8 (8 by maintainers)

Top GitHub Comments

1reaction

kmikecommented, Jun 29, 2018

I’m closing this ticket, as parsel has has_class function built-in now, and provides a simplified way to register custom XPath functions (via parsel.xpathfuncs.set_xpathfunc) - see http://parsel.readthedocs.io/en/latest/usage.html#other-xpath-extensions.

0reactions

redapplecommented, Nov 14, 2016

So I looked at this today and wanted to “benchmark” different implementations.

I compared:

CSS selector to XPath (cssselect-based)
the above proposal for has-class using an XPath call within the Python function (https://github.com/scrapy/parsel/issues/13#issue-100686360)
another implementation using set comparisons (https://github.com/scrapy/scrapy/issues/753#issuecomment-51502883)
another one using set but for 1 class only (when it makes sense)

I used this script with the homepage of the New York Times, directly on the lxml-parsed document:

import timeit

"""
Looking at New York Times homepage:

Common classes:
[
('story', 147),
('story-heading', 145),
('theme-summary', 64),
('column', 58),
('collection', 50),
('section-heading', 49),
('story-link', 39),
('icon', 38),
('thumb', 35),
('ad', 33)
]

Many
<article class="story" ...
or
<article class="story theme-summary ...
"""

SETUP = '''
import lxml.etree
import lxml.html

CLASS_EXPR = "contains(concat(' ', normalize-space(@class), ' '), ' {} ')"

def has_class(context, *classes):
    """
    This lxml extension allows to select by CSS class more easily
    >>> ns = etree.FunctionNamespace(None)
    >>> ns['has-class'] = has_class
    """
    expressions = ' and '.join([CLASS_EXPR.format(c) for c in classes])
    xpath = 'self::*[@class and {}]'.format(expressions)
    return bool(context.context_node.xpath(xpath))

def has_class_set(context, *classes):
    class_attr = context.context_node.get("class")
    if class_attr:
        return set(classes) < set(class_attr.split())

def has_one_class(context, cls):
    return cls in context.context_node.get("class", "").split()

ns = lxml.etree.FunctionNamespace(None)
ns['has-class'] = has_class
ns['has-class-set'] = has_class_set
ns['has-one-class'] = has_one_class

url = 'http://www.nytimes.com/'
body = open('nytimes.html', 'rb').read()

doc = lxml.html.fromstring(body)

'''

N = 100

def _t(stmt, setup=SETUP, number=N, ref=None):
    v = timeit.timeit(stmt, setup, number=number)
    rel = 1.0 if ref is None else (v / ref)
    print '%-70s %6.3f %6.3f' % (stmt, v, rel)
    return v


ref = _t('doc.cssselect(".story")')
_t('doc.xpath("//*[has-class(\'story\')]")', ref=ref)
_t('doc.xpath("//*[has-class-set(\'story\')]")', ref=ref)
_t('doc.xpath("//*[has-one-class(\'story\')]")', ref=ref)
print("\n")

ref = _t('doc.cssselect("article.story")')
_t('doc.xpath("//article[has-class(\'story\')]")', ref=ref)
_t('doc.xpath("//article[has-class-set(\'story\')]")', ref=ref)
print("\n")

ref = _t('doc.cssselect("article.theme-summary.story")')
_t('doc.xpath("//article[has-class(\'theme-summary\', \'story\')]")', ref=ref)
_t('doc.xpath("//article[has-class-set(\'theme-summary\', \'story\')]")', ref=ref)

And this is what I get:

doc.cssselect(".story")                                                 0.334  1.000
doc.xpath("//*[has-class('story')]")                                    6.667 19.941
doc.xpath("//*[has-class-set('story')]")                                1.036  3.097
doc.xpath("//*[has-one-class('story')]")                                0.931  2.785


doc.cssselect("article.story")                                          0.065  1.000
doc.xpath("//article[has-class('story')]")                              0.630  9.708
doc.xpath("//article[has-class-set('story')]")                          0.125  1.933


doc.cssselect("article.theme-summary.story")                            0.082  1.000
doc.xpath("//article[has-class('theme-summary', 'story')]")             0.698  8.463
doc.xpath("//article[has-class-set('theme-summary', 'story')]")         0.130  1.581

So there seems to be always a non-negligible penalty using custom XPath/Python functions. Using cssselect translation to XPath looks faster in all cases. Can someone else double check this?