question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

CDATA handling in HTML changed in lxml parser with libxml2 2.9.12

See original GitHub issue

After upgrading the system libxml2 to 2.9.12 (or 2.9.11; 2.9.10 is the previous working version I have here), the two following tests fail with lxml built against the system library:

FAILED tests/test_extra/test_soup_contains.py::TestSoupContains::test_contains_cdata_html - AssertionError: Lists differ: ['1', '2'] != ['1']
FAILED tests/test_extra/test_soup_contains_own.py::TestSoupContainsOwn::test_contains_own_cdata_html - AssertionError: Lists differ: ['1', '2']...

The cause seems to be a different representation of CDATA:

        soup       = <html><body><div id="1">Testing that <span id="2">&lt;![CDATA[that]]&gt;</span>contains works.</div></body>
</html>

(i.e. &lt![CDATA[... instead of <!--[CDATA[...)

Note that in order to reproduce you need to both upgrade libxml2 and build lxml against the new version. Binary wheels are statically linked to an old version of libxml2, so they do not reproduce the issue yet. For example, I have been able to reproduce it with tox after swapping the installed lxml version:

. .tox/py39/bin/activate
pip uninstall lxml
pip install lxml --no-binary lxml

I am also not sure whether this isn’t a bug in libxml2 or lxml.

Issue Analytics

  • State:closed
  • Created 2 years ago
  • Comments:21 (16 by maintainers)

github_iconTop GitHub Comments

1reaction
mgornycommented, May 30, 2021

Don’t worry, there’s no hurry. Worst case, we can ignore these two tests. But I’m going to try to figure out what changed first.

1reaction
facelessusercommented, May 29, 2021

Hmm. I would first check if this is expected behavior in lxml first. I’m not sure how this change looks in pure lxml and if they are even handling things correctly. Can you test pure lxml, and see if it returns proper results?

If this is expected lxml behavior, I will then have to check if BeautifulSoup should be handling this differently.

Only if both of the above libraries are handling things as expected, will I then have to investigate what needs to change in Soup Sieve.

@gir-bot add S: more-info-needed

Read more comments on GitHub >

github_iconTop Results From Across the Web

HTMLParser handling of <![CDATA[...]]> changed w/ libxml2 ...
It seems that the handling of <![CDATA[...]]> inside HTMLParser has changed when built against libxml2 2.9.11+. I'm currently trying to figure ...
Read more >
lxml FAQ - Frequently Asked Questions
lxml.etree is a generic API for XML and HTML handling. It aims for ElementTree compatibility and supports the entire XML infoset. It is...
Read more >
libxml2 fails to handle CDATA in HTML correctly
Why do you expect an XML parser to parse SGML? – bmargulies. Dec 26, 2010 at 17:38. 1.
Read more >
libxml2-devel-2.9.12-150400.3.4.s390x RPM
Tue Jun 01 2021 pmonreal@suse.com - Fix python-lxml regression with libxml2 2.9.12: * Work around lxml API abuse: gitlab.gnome.org/GNOME/libxml2/issues/255 ...
Read more >
lxml Changelog - pyup.io
Wheels include libxml2 2.9.12+ and libxslt 1.1.34 (also on Windows). ... LP1665241, GH228: Form data handling in lxml.html no longer strips the
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found