CDATA handling in HTML changed in lxml parser with libxml2 2.9.12
See original GitHub issueAfter upgrading the system libxml2 to 2.9.12 (or 2.9.11; 2.9.10 is the previous working version I have here), the two following tests fail with lxml built against the system library:
FAILED tests/test_extra/test_soup_contains.py::TestSoupContains::test_contains_cdata_html - AssertionError: Lists differ: ['1', '2'] != ['1']
FAILED tests/test_extra/test_soup_contains_own.py::TestSoupContainsOwn::test_contains_own_cdata_html - AssertionError: Lists differ: ['1', '2']...
The cause seems to be a different representation of CDATA:
soup = <html><body><div id="1">Testing that <span id="2"><![CDATA[that]]></span>contains works.</div></body>
</html>
(i.e. <![CDATA[...
instead of <!--[CDATA[...
)
Note that in order to reproduce you need to both upgrade libxml2 and build lxml against the new version. Binary wheels are statically linked to an old version of libxml2, so they do not reproduce the issue yet. For example, I have been able to reproduce it with tox after swapping the installed lxml version:
. .tox/py39/bin/activate
pip uninstall lxml
pip install lxml --no-binary lxml
I am also not sure whether this isn’t a bug in libxml2 or lxml.
Issue Analytics
- State:
- Created 2 years ago
- Comments:21 (16 by maintainers)
Top Results From Across the Web
HTMLParser handling of <![CDATA[...]]> changed w/ libxml2 ...
It seems that the handling of <![CDATA[...]]> inside HTMLParser has changed when built against libxml2 2.9.11+. I'm currently trying to figure ...
Read more >lxml FAQ - Frequently Asked Questions
lxml.etree is a generic API for XML and HTML handling. It aims for ElementTree compatibility and supports the entire XML infoset. It is...
Read more >libxml2 fails to handle CDATA in HTML correctly
Why do you expect an XML parser to parse SGML? – bmargulies. Dec 26, 2010 at 17:38. 1.
Read more >libxml2-devel-2.9.12-150400.3.4.s390x RPM
Tue Jun 01 2021 pmonreal@suse.com - Fix python-lxml regression with libxml2 2.9.12: * Work around lxml API abuse: gitlab.gnome.org/GNOME/libxml2/issues/255 ...
Read more >lxml Changelog - pyup.io
Wheels include libxml2 2.9.12+ and libxslt 1.1.34 (also on Windows). ... LP1665241, GH228: Form data handling in lxml.html no longer strips the
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
Don’t worry, there’s no hurry. Worst case, we can ignore these two tests. But I’m going to try to figure out what changed first.
Hmm. I would first check if this is expected behavior in lxml first. I’m not sure how this change looks in pure lxml and if they are even handling things correctly. Can you test pure lxml, and see if it returns proper results?
If this is expected lxml behavior, I will then have to check if BeautifulSoup should be handling this differently.
Only if both of the above libraries are handling things as expected, will I then have to investigate what needs to change in Soup Sieve.
@gir-bot add S: more-info-needed