Selector mis-parses html when elements have very large bodies
See original GitHub issueThis was discovered by a Reddit user, concerning an Amazon page with an absurdly long <script>
tag, but I was able to boil the bad outcome down into a reproducible test case
what is expected
Selector(html).css('h1')
should produce all h1
elements within the document
what actually happens
Selector(html).css('h1')
produces only the h1
elements before the element containing a very large body. Neither xml.etree
nor html5lib
suffer from this defect.
pip install html5lib==1.0.1
pip install parsel==1.4.0
import html5lib
import parsel
import time
try:
from xml.etree import cElementTree as ElementTree
except ImportError:
from xml.etree import ElementTree
bad_len = 21683148
bad = 'a' * bad_len
bad_html = '''
<html>
<body>
<h1>pre-div h1</h1>
<div>
<h1>pre-script h1</h1>
<p>var bogus = "{}"</p>
<h1>hello I am eclipsed</h1>
</div>
<h1>post-div h1</h1>
</body>
</html>
'''.format(bad)
t0 = time.time()
sel = parsel.Selector(text=bad_html)
t1 = time.time()
print('Selector.time={}'.format(t1 - t0))
for idx, h1 in enumerate(sel.xpath('//h1').extract()):
print('h1[{} = {}'.format(idx, h1))
print('ElementTree')
t0 = time.time()
doc = ElementTree.fromstring(bad_html)
t1 = time.time()
print('ElementTree.time={}'.format(t1 - t0))
for idx, h1 in enumerate(doc.findall('.//h1')):
print('h1[{}].txt = <<{}>>'.format(h1, h1.text))
print('html5lib')
t0 = time.time()
#: :type: xml.etree.ElementTree.Element
doc = html5lib.parse(bad_html, namespaceHTMLElements=False)
t1 = time.time()
print('html5lib.time={}'.format(t1 - t0))
for idx, h1 in enumerate(doc.findall('.//h1')):
print('h1[{}].txt = <<{}>>'.format(h1, h1.text))
produces the output
Selector.time=0.3661611080169678
h1[0 = <h1>pre-div h1</h1>
h1[1 = <h1>pre-script h1</h1>
ElementTree
ElementTree.time=0.1052100658416748
h1[<Element 'h1' at 0x103029bd8>].txt = <<pre-div h1>>
h1[<Element 'h1' at 0x103029c78>].txt = <<pre-script h1>>
h1[<Element 'h1' at 0x103029d18>].txt = <<hello I am eclipsed>>
h1[<Element 'h1' at 0x103029d68>].txt = <<post-div h1>>
html5lib
html5lib.time=2.255831003189087
h1[<Element 'h1' at 0x107395098>].txt = <<pre-div h1>>
h1[<Element 'h1' at 0x1073951d8>].txt = <<pre-script h1>>
h1[<Element 'h1' at 0x107395318>].txt = <<hello I am eclipsed>>
h1[<Element 'h1' at 0x1073953b8>].txt = <<post-div h1>>
Issue Analytics
- State:
- Created 6 years ago
- Comments:7 (2 by maintainers)
Top Results From Across the Web
index.html - Google Git
The spec is very vague about what exactly should be in the range ... fail("PNG was parsed as HTML."); ... expect(doc.body, 0, "misparsed...
Read more >Vulnerability report for Docker node:6.14-slim - Snyk
Docker image node has 189 known vulnerabilities found in 430 vulnerable paths. ... This output data can grow larger than the local buffer...
Read more >How do I center the body in CSS? - Quora
To center the body of an HTML document using CSS, you can use the margin property and set it to auto . This...
Read more >thesis - Naval Postgraduate School
Policy bases can be very large and complex; these factors are compounded by ... The policy-element identifier is an extension of the policy-selector...
Read more >Activating Browser Modes with Doctype - Henri Sivonen
In the XML mode, selectors have different case-sensitivity behavior. Furthermore, special rules for the HTML body element do not apply in ...
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
I dislike version checking and would prefer to just require new lxml with new parsel, but I’m not the one making the decision.
How would lower versions be handled? Do we return a selector that behaves differently depending on lxml version? Do we issue a warning?
I think for having the option to disable it, having a keyword arg for
Selector.__init__
makes the most sense.I did a pull request to this issue, but the question @stranac raises still stands. I think the main problem are these scenarios:
huge_tree
but it is disabled via the argument (see source code)huge_tree
but it is enabled (either passed or on via default).How would you suggest do handle these cases? Right now, I implemented so that both scenarios would fail and raise a
ValueError
.