Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Selector mis-parses html when elements have very large bodies

See original GitHub issue

This was discovered by a Reddit user, concerning an Amazon page with an absurdly long <script> tag, but I was able to boil the bad outcome down into a reproducible test case

what is expected

Selector(html).css('h1') should produce all h1 elements within the document

what actually happens

Selector(html).css('h1') produces only the h1 elements before the element containing a very large body. Neither xml.etree nor html5lib suffer from this defect.

pip install html5lib==1.0.1
pip install parsel==1.4.0

import html5lib
import parsel
import time

try:
    from xml.etree import cElementTree as ElementTree
except ImportError:
    from xml.etree import ElementTree

bad_len = 21683148
bad = 'a' * bad_len
bad_html = '''
<html>
    <body>
      <h1>pre-div h1</h1>
      <div>
        <h1>pre-script h1</h1>
        <p>var bogus = "{}"</p>
        <h1>hello I am eclipsed</h1>
      </div>
      <h1>post-div h1</h1>
    </body>
</html>
'''.format(bad)
t0 = time.time()
sel = parsel.Selector(text=bad_html)
t1 = time.time()
print('Selector.time={}'.format(t1 - t0))
for idx, h1 in enumerate(sel.xpath('//h1').extract()):
    print('h1[{} = {}'.format(idx, h1))

print('ElementTree')
t0 = time.time()
doc = ElementTree.fromstring(bad_html)
t1 = time.time()
print('ElementTree.time={}'.format(t1 - t0))
for idx, h1 in enumerate(doc.findall('.//h1')):
    print('h1[{}].txt = <<{}>>'.format(h1, h1.text))

print('html5lib')
t0 = time.time()
#: :type: xml.etree.ElementTree.Element
doc = html5lib.parse(bad_html, namespaceHTMLElements=False)
t1 = time.time()
print('html5lib.time={}'.format(t1 - t0))
for idx, h1 in enumerate(doc.findall('.//h1')):
    print('h1[{}].txt = <<{}>>'.format(h1, h1.text))

produces the output

Selector.time=0.3661611080169678
h1[0 = <h1>pre-div h1</h1>
h1[1 = <h1>pre-script h1</h1>
ElementTree
ElementTree.time=0.1052100658416748
h1[<Element 'h1' at 0x103029bd8>].txt = <<pre-div h1>>
h1[<Element 'h1' at 0x103029c78>].txt = <<pre-script h1>>
h1[<Element 'h1' at 0x103029d18>].txt = <<hello I am eclipsed>>
h1[<Element 'h1' at 0x103029d68>].txt = <<post-div h1>>
html5lib
html5lib.time=2.255831003189087
h1[<Element 'h1' at 0x107395098>].txt = <<pre-div h1>>
h1[<Element 'h1' at 0x1073951d8>].txt = <<pre-script h1>>
h1[<Element 'h1' at 0x107395318>].txt = <<hello I am eclipsed>>
h1[<Element 'h1' at 0x1073953b8>].txt = <<post-div h1>>

Issue Analytics

State:
Created 6 years ago
Comments:7 (2 by maintainers)

Top GitHub Comments

1reaction

stranaccommented, Mar 13, 2018

I dislike version checking and would prefer to just require new lxml with new parsel, but I’m not the one making the decision.

How would lower versions be handled? Do we return a selector that behaves differently depending on lxml version? Do we issue a warning?

I think for having the option to disable it, having a keyword arg for Selector.__init__ makes the most sense.

0reactions

Langdicommented, Jul 10, 2018

I did a pull request to this issue, but the question @stranac raises still stands. I think the main problem are these scenarios:

lxml supports huge_tree but it is disabled via the argument (see source code)
lxml doesn’t support huge_tree but it is enabled (either passed or on via default).

How would you suggest do handle these cases? Right now, I implemented so that both scenarios would fail and raise a ValueError.

Top Results From Across the Web

index.html - Google Git

The spec is very vague about what exactly should be in the range ... fail("PNG was parsed as HTML."); ... expect(doc.body, 0, "misparsed...

Vulnerability report for Docker node:6.14-slim - Snyk

Docker image node has 189 known vulnerabilities found in 430 vulnerable paths. ... This output data can grow larger than the local buffer...

How do I center the body in CSS? - Quora

To center the body of an HTML document using CSS, you can use the margin property and set it to auto . This...

thesis - Naval Postgraduate School

Policy bases can be very large and complex; these factors are compounded by ... The policy-element identifier is an extension of the policy-selector...

Activating Browser Modes with Doctype - Henri Sivonen

In the XML mode, selectors have different case-sensitivity behavior. Furthermore, special rules for the HTML body element do not apply in ...