Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

HTML code extraction from node is not working

See original GitHub issue

I’ve installed Scrapy into a new environment recently and now, when trying to get the HTML source of a node, the selector returns the node and the subsequent code in the whole source.

Note: I installed parsel with Scrapy into conda environments using the conda-forge channel.

Current behavior:

scrapy shell https://picsart.com/design-templates

In [1]: response.css('h1').get()
Out[1]: '<h1 class="banner__title___2T69O" data-test="primary-title">Anything-But-Basic Design Templates</h1><div class="banner__note___2usDR" data-test="primary-description"><p>Picsart’s free templates are easy to use and fitting for any special occasion.</p></div></div><div class="banner__buttonsContainer___1rJTU"><div class="actionButton__blueAction___2HAym actionButton__actionContainer___26zdu"><a class="actionButton__action___1-L0i root-0-2-30 primary-0-2-31 responsive-0-2-33" data-test="primary-button" href="/create/editor">Try Templates</a></div></div></div></div><div class="banner__imageBlock___3OBoa"><div class="banner__imageHolder___2QGdJ"><picture class="root-0-2-42 banner__image___LKEfe"><source type="image/webp" media="(min-width: 1365px)" srcset="https://cdn130.picsart.com/46322149886354257770.png?type=webp&amp;to=min&amp;r=1365"></source><source type="image/webp" media="(min-width: 1023px)" srcset="https://cdn130.picsart.com/46322149886354257770.png?type=webp&amp;to=min&amp;r=1023"></source>
... I CROPPED THE OUTPUT ...'

The env is composed by:

Scrapy 2.5.0
Versions: lxml 4.6.3.0, libxml2 2.9.12, cssselect 1.1.0, parsel 1.6.0, w3lib 1.22.0, Twisted 21.7.0, Python 3.8.10

Previous behavior:

scrapy shell https://picsart.com/design-templates

In [1]: response.css('h1').get()
Out[1]: '<h1 class="banner__title___2T69O" data-test="primary-title">Anything-But-Basic Design Templates</h1>'

This env is composed by:

Scrapy 2.5.0
Versions: lxml 4.5.2.0, libxml2 2.9.10, cssselect 1.1.0, parsel 1.6.0, w3lib 1.22.0, Twisted 20.3.0, Python 3.8.5
Previous behavior is preferred.
Is this an issue or is it the standard way it should behave now?
I tried downgrading lxml to 4.5.2 since parsel dependencies are just a few and only lxml is not matching between these two environments, but nothing changed.

Issue Analytics

State:
Created 2 years ago
Reactions:1
Comments:7 (2 by maintainers)

Top GitHub Comments

4reactions

borealbusinesscommented, Sep 6, 2021

Could be linked to that : https://gitlab.gnome.org/GNOME/libxml2/-/issues/255 Edit : definitly linked to that issue, downgrading to libxml 2.9.10 fixes the problem

1reaction

wRARcommented, Oct 29, 2022

AFAICS this is fixed in newer libxml2 so I don’t think this should stay open.