question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Selector mis-parses html when elements have very large bodies

See original GitHub issue

This was discovered by a Reddit user, concerning an Amazon page with an absurdly long <script> tag, but I was able to boil the bad outcome down into a reproducible test case

what is expected

Selector(html).css('h1') should produce all h1 elements within the document

what actually happens

Selector(html).css('h1') produces only the h1 elements before the element containing a very large body. Neither xml.etree nor html5lib suffer from this defect.


pip install html5lib==1.0.1
pip install parsel==1.4.0
import html5lib
import parsel
import time

try:
    from xml.etree import cElementTree as ElementTree
except ImportError:
    from xml.etree import ElementTree

bad_len = 21683148
bad = 'a' * bad_len
bad_html = '''
<html>
    <body>
      <h1>pre-div h1</h1>
      <div>
        <h1>pre-script h1</h1>
        <p>var bogus = "{}"</p>
        <h1>hello I am eclipsed</h1>
      </div>
      <h1>post-div h1</h1>
    </body>
</html>
'''.format(bad)
t0 = time.time()
sel = parsel.Selector(text=bad_html)
t1 = time.time()
print('Selector.time={}'.format(t1 - t0))
for idx, h1 in enumerate(sel.xpath('//h1').extract()):
    print('h1[{} = {}'.format(idx, h1))

print('ElementTree')
t0 = time.time()
doc = ElementTree.fromstring(bad_html)
t1 = time.time()
print('ElementTree.time={}'.format(t1 - t0))
for idx, h1 in enumerate(doc.findall('.//h1')):
    print('h1[{}].txt = <<{}>>'.format(h1, h1.text))

print('html5lib')
t0 = time.time()
#: :type: xml.etree.ElementTree.Element
doc = html5lib.parse(bad_html, namespaceHTMLElements=False)
t1 = time.time()
print('html5lib.time={}'.format(t1 - t0))
for idx, h1 in enumerate(doc.findall('.//h1')):
    print('h1[{}].txt = <<{}>>'.format(h1, h1.text))

produces the output

Selector.time=0.3661611080169678
h1[0 = <h1>pre-div h1</h1>
h1[1 = <h1>pre-script h1</h1>
ElementTree
ElementTree.time=0.1052100658416748
h1[<Element 'h1' at 0x103029bd8>].txt = <<pre-div h1>>
h1[<Element 'h1' at 0x103029c78>].txt = <<pre-script h1>>
h1[<Element 'h1' at 0x103029d18>].txt = <<hello I am eclipsed>>
h1[<Element 'h1' at 0x103029d68>].txt = <<post-div h1>>
html5lib
html5lib.time=2.255831003189087
h1[<Element 'h1' at 0x107395098>].txt = <<pre-div h1>>
h1[<Element 'h1' at 0x1073951d8>].txt = <<pre-script h1>>
h1[<Element 'h1' at 0x107395318>].txt = <<hello I am eclipsed>>
h1[<Element 'h1' at 0x1073953b8>].txt = <<post-div h1>>

Issue Analytics

  • State:open
  • Created 6 years ago
  • Comments:7 (2 by maintainers)

github_iconTop GitHub Comments

1reaction
stranaccommented, Mar 13, 2018

I dislike version checking and would prefer to just require new lxml with new parsel, but I’m not the one making the decision.

How would lower versions be handled? Do we return a selector that behaves differently depending on lxml version? Do we issue a warning?

I think for having the option to disable it, having a keyword arg for Selector.__init__ makes the most sense.

0reactions
Langdicommented, Jul 10, 2018

I did a pull request to this issue, but the question @stranac raises still stands. I think the main problem are these scenarios:

  1. lxml supports huge_tree but it is disabled via the argument (see source code)
  2. lxml doesn’t support huge_tree but it is enabled (either passed or on via default).

How would you suggest do handle these cases? Right now, I implemented so that both scenarios would fail and raise a ValueError.

Read more comments on GitHub >

github_iconTop Results From Across the Web

index.html - Google Git
The spec is very vague about what exactly should be in the range ... fail("PNG was parsed as HTML."); ... expect(doc.body, 0, "misparsed...
Read more >
Vulnerability report for Docker node:6.14-slim - Snyk
Docker image node has 189 known vulnerabilities found in 430 vulnerable paths. ... This output data can grow larger than the local buffer...
Read more >
How do I center the body in CSS? - Quora
To center the body of an HTML document using CSS, you can use the margin property and set it to auto . This...
Read more >
thesis - Naval Postgraduate School
Policy bases can be very large and complex; these factors are compounded by ... The policy-element identifier is an extension of the policy-selector...
Read more >
Activating Browser Modes with Doctype - Henri Sivonen
In the XML mode, selectors have different case-sensitivity behavior. Furthermore, special rules for the HTML body element do not apply in ...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found