question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

custom xpath support

See original GitHub issue

custom xpath functions could be added here? like:

# Original Source: https://gist.github.com/shirk3y/458224083ce5464627bc
from lxml import etree

CLASS_EXPR = "contains(concat(' ', normalize-space(@class), ' '), ' {} ')"

def has_class(context, *classes):
    """
    This lxml extension allows to select by CSS class more easily
    >>> ns = etree.FunctionNamespace(None)
    >>> ns['has-class'] = has_class
    >>> root = etree.XML('''
    ... <a>
    ...     <b class="one first text">I</b>
    ...     <b class="two text">LOVE</b>
    ...     <b class="three text">CSS</b>
    ... </a>
    ... ''')
    >>> len(root.xpath('//b[has-class("text")]'))
    3
    >>> len(root.xpath('//b[has-class("one")]'))
    1
    >>> len(root.xpath('//b[has-class("text", "first")]'))
    1
    >>> len(root.xpath('//b[not(has-class("first"))]'))
    2
    >>> len(root.xpath('//b[has-class("not-exists")]'))
    0
    """

    expressions = ' and '.join([CLASS_EXPR.format(c) for c in classes])
    xpath = 'self::*[@class and {}]'.format(expressions)
    return bool(context.context_node.xpath(xpath))

I think it is a common practice to create custom xpaths on different projects.

Issue Analytics

  • State:closed
  • Created 8 years ago
  • Reactions:1
  • Comments:8 (8 by maintainers)

github_iconTop GitHub Comments

1reaction
kmikecommented, Jun 29, 2018

I’m closing this ticket, as parsel has has_class function built-in now, and provides a simplified way to register custom XPath functions (via parsel.xpathfuncs.set_xpathfunc) - see http://parsel.readthedocs.io/en/latest/usage.html#other-xpath-extensions.

0reactions
redapplecommented, Nov 14, 2016

So I looked at this today and wanted to “benchmark” different implementations.

I compared:

  1. CSS selector to XPath (cssselect-based)
  2. the above proposal for has-class using an XPath call within the Python function (https://github.com/scrapy/parsel/issues/13#issue-100686360)
  3. another implementation using set comparisons (https://github.com/scrapy/scrapy/issues/753#issuecomment-51502883)
  4. another one using set but for 1 class only (when it makes sense)

I used this script with the homepage of the New York Times, directly on the lxml-parsed document:

import timeit

"""
Looking at New York Times homepage:

Common classes:
[
('story', 147),
('story-heading', 145),
('theme-summary', 64),
('column', 58),
('collection', 50),
('section-heading', 49),
('story-link', 39),
('icon', 38),
('thumb', 35),
('ad', 33)
]

Many
<article class="story" ...
or
<article class="story theme-summary ...
"""

SETUP = '''
import lxml.etree
import lxml.html

CLASS_EXPR = "contains(concat(' ', normalize-space(@class), ' '), ' {} ')"

def has_class(context, *classes):
    """
    This lxml extension allows to select by CSS class more easily
    >>> ns = etree.FunctionNamespace(None)
    >>> ns['has-class'] = has_class
    """
    expressions = ' and '.join([CLASS_EXPR.format(c) for c in classes])
    xpath = 'self::*[@class and {}]'.format(expressions)
    return bool(context.context_node.xpath(xpath))

def has_class_set(context, *classes):
    class_attr = context.context_node.get("class")
    if class_attr:
        return set(classes) < set(class_attr.split())

def has_one_class(context, cls):
    return cls in context.context_node.get("class", "").split()

ns = lxml.etree.FunctionNamespace(None)
ns['has-class'] = has_class
ns['has-class-set'] = has_class_set
ns['has-one-class'] = has_one_class

url = 'http://www.nytimes.com/'
body = open('nytimes.html', 'rb').read()

doc = lxml.html.fromstring(body)

'''

N = 100

def _t(stmt, setup=SETUP, number=N, ref=None):
    v = timeit.timeit(stmt, setup, number=number)
    rel = 1.0 if ref is None else (v / ref)
    print '%-70s %6.3f %6.3f' % (stmt, v, rel)
    return v


ref = _t('doc.cssselect(".story")')
_t('doc.xpath("//*[has-class(\'story\')]")', ref=ref)
_t('doc.xpath("//*[has-class-set(\'story\')]")', ref=ref)
_t('doc.xpath("//*[has-one-class(\'story\')]")', ref=ref)
print("\n")

ref = _t('doc.cssselect("article.story")')
_t('doc.xpath("//article[has-class(\'story\')]")', ref=ref)
_t('doc.xpath("//article[has-class-set(\'story\')]")', ref=ref)
print("\n")

ref = _t('doc.cssselect("article.theme-summary.story")')
_t('doc.xpath("//article[has-class(\'theme-summary\', \'story\')]")', ref=ref)
_t('doc.xpath("//article[has-class-set(\'theme-summary\', \'story\')]")', ref=ref)

And this is what I get:

doc.cssselect(".story")                                                 0.334  1.000
doc.xpath("//*[has-class('story')]")                                    6.667 19.941
doc.xpath("//*[has-class-set('story')]")                                1.036  3.097
doc.xpath("//*[has-one-class('story')]")                                0.931  2.785


doc.cssselect("article.story")                                          0.065  1.000
doc.xpath("//article[has-class('story')]")                              0.630  9.708
doc.xpath("//article[has-class-set('story')]")                          0.125  1.933


doc.cssselect("article.theme-summary.story")                            0.082  1.000
doc.xpath("//article[has-class('theme-summary', 'story')]")             0.698  8.463
doc.xpath("//article[has-class-set('theme-summary', 'story')]")         0.130  1.581

So there seems to be always a non-negligible penalty using custom XPath/Python functions. Using cssselect translation to XPath looks faster in all cases. Can someone else double check this?

Read more comments on GitHub >

github_iconTop Results From Across the Web

23 Creating and Using Custom XPath Functions
This chapter describes how to create, register, and use custom XPath functions in XQuery expressions within Oracle Service Bus.
Read more >
Custom XPath - IBM
You can use the Custom XPath transform to provide a data value for a simple target element, or values for a repeating simple...
Read more >
How to Write Effective XPaths in Selenium with Examples?
XPath Example : Usage of XPath functions and Axes in Selenium ... 'custom-control custom-radio custom-control-inline']/descendant::input.
Read more >
XPath in Selenium: How to Find & Write? (Text, Contains, AND)
In this example, we tried to identify the element by just using partial text value of the attribute. In the below XPath expression...
Read more >
Custom XPath functions - W3C XForms Group Wiki (Public)
support a simple syntax, which can later be expanded (e.g. with something closer from XSLT 2's sequence constructors). can be defined and used...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found