HTML code extraction from node is not working
See original GitHub issueI’ve installed Scrapy into a new environment recently and now, when trying to get the HTML source of a node, the selector returns the node and the subsequent code in the whole source.
Note: I installed parsel
with Scrapy
into conda environments using the conda-forge
channel.
Current behavior:
scrapy shell https://picsart.com/design-templates
In [1]: response.css('h1').get()
Out[1]: '<h1 class="banner__title___2T69O" data-test="primary-title">Anything-But-Basic Design Templates</h1><div class="banner__note___2usDR" data-test="primary-description"><p>Picsart’s free templates are easy to use and fitting for any special occasion.</p></div></div><div class="banner__buttonsContainer___1rJTU"><div class="actionButton__blueAction___2HAym actionButton__actionContainer___26zdu"><a class="actionButton__action___1-L0i root-0-2-30 primary-0-2-31 responsive-0-2-33" data-test="primary-button" href="/create/editor">Try Templates</a></div></div></div></div><div class="banner__imageBlock___3OBoa"><div class="banner__imageHolder___2QGdJ"><picture class="root-0-2-42 banner__image___LKEfe"><source type="image/webp" media="(min-width: 1365px)" srcset="https://cdn130.picsart.com/46322149886354257770.png?type=webp&to=min&r=1365"></source><source type="image/webp" media="(min-width: 1023px)" srcset="https://cdn130.picsart.com/46322149886354257770.png?type=webp&to=min&r=1023"></source>
... I CROPPED THE OUTPUT ...'
The env is composed by:
- Scrapy 2.5.0
- Versions: lxml 4.6.3.0, libxml2 2.9.12, cssselect 1.1.0, parsel 1.6.0, w3lib 1.22.0, Twisted 21.7.0, Python 3.8.10
Previous behavior:
scrapy shell https://picsart.com/design-templates
In [1]: response.css('h1').get()
Out[1]: '<h1 class="banner__title___2T69O" data-test="primary-title">Anything-But-Basic Design Templates</h1>'
This env is composed by:
-
Scrapy 2.5.0
-
Versions: lxml 4.5.2.0, libxml2 2.9.10, cssselect 1.1.0, parsel 1.6.0, w3lib 1.22.0, Twisted 20.3.0, Python 3.8.5
-
Previous behavior is preferred.
-
Is this an issue or is it the standard way it should behave now?
-
I tried downgrading
lxml
to 4.5.2 sinceparsel
dependencies are just a few and onlylxml
is not matching between these two environments, but nothing changed.
Issue Analytics
- State:
- Created 2 years ago
- Reactions:1
- Comments:7 (2 by maintainers)
Top Results From Across the Web
Extract href from HTML Element not working using node.js
I'm trying to scrape data from that website: https://www.gelbeseiten.de/Suche/Fotografen/Berlin and for some reason I can't get a specific ...
Read more >The Ultimate Guide to Web Scraping with Node.js
Working through the examples in this guide, you will learn all the tips and tricks you need to become a pro at gathering...
Read more >Web Scraping with JavaScript and NodeJS - ScrapingBee
Unlike Cheerio, however, jsdom does not only parse HTML into a DOM tree, it can also handle embedded JavaScript code and it allows...
Read more >Web Scraping with TypeScript and Node.js - This Dot Labs
If you want to skip straight to the finish code example, ... Now that we have HTML to work with, we want to...
Read more >Web Scraping in Javascript and NodeJS - ZenRows
For the code to work, you will need Node (or nvm) and npm installed. ... We will pass the HTML to cheerio and...
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
Could be linked to that : https://gitlab.gnome.org/GNOME/libxml2/-/issues/255 Edit : definitly linked to that issue, downgrading to libxml 2.9.10 fixes the problem
AFAICS this is fixed in newer libxml2 so I don’t think this should stay open.