question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

HTML code extraction from node is not working

See original GitHub issue

I’ve installed Scrapy into a new environment recently and now, when trying to get the HTML source of a node, the selector returns the node and the subsequent code in the whole source.

Note: I installed parsel with Scrapy into conda environments using the conda-forge channel.

Current behavior:

scrapy shell https://picsart.com/design-templates
In [1]: response.css('h1').get()
Out[1]: '<h1 class="banner__title___2T69O" data-test="primary-title">Anything-But-Basic Design Templates</h1><div class="banner__note___2usDR" data-test="primary-description"><p>Picsart’s free templates are easy to use and fitting for any special occasion.</p></div></div><div class="banner__buttonsContainer___1rJTU"><div class="actionButton__blueAction___2HAym actionButton__actionContainer___26zdu"><a class="actionButton__action___1-L0i root-0-2-30 primary-0-2-31 responsive-0-2-33" data-test="primary-button" href="/create/editor">Try Templates</a></div></div></div></div><div class="banner__imageBlock___3OBoa"><div class="banner__imageHolder___2QGdJ"><picture class="root-0-2-42 banner__image___LKEfe"><source type="image/webp" media="(min-width: 1365px)" srcset="https://cdn130.picsart.com/46322149886354257770.png?type=webp&amp;to=min&amp;r=1365"></source><source type="image/webp" media="(min-width: 1023px)" srcset="https://cdn130.picsart.com/46322149886354257770.png?type=webp&amp;to=min&amp;r=1023"></source>
... I CROPPED THE OUTPUT ...'

The env is composed by:

  • Scrapy 2.5.0
  • Versions: lxml 4.6.3.0, libxml2 2.9.12, cssselect 1.1.0, parsel 1.6.0, w3lib 1.22.0, Twisted 21.7.0, Python 3.8.10

Previous behavior:

scrapy shell https://picsart.com/design-templates
In [1]: response.css('h1').get()
Out[1]: '<h1 class="banner__title___2T69O" data-test="primary-title">Anything-But-Basic Design Templates</h1>'

This env is composed by:

  • Scrapy 2.5.0

  • Versions: lxml 4.5.2.0, libxml2 2.9.10, cssselect 1.1.0, parsel 1.6.0, w3lib 1.22.0, Twisted 20.3.0, Python 3.8.5

  • Previous behavior is preferred.

  • Is this an issue or is it the standard way it should behave now?

  • I tried downgrading lxml to 4.5.2 since parsel dependencies are just a few and only lxml is not matching between these two environments, but nothing changed.

Issue Analytics

  • State:closed
  • Created 2 years ago
  • Reactions:1
  • Comments:7 (2 by maintainers)

github_iconTop GitHub Comments

4reactions
borealbusinesscommented, Sep 6, 2021

Could be linked to that : https://gitlab.gnome.org/GNOME/libxml2/-/issues/255 Edit : definitly linked to that issue, downgrading to libxml 2.9.10 fixes the problem

1reaction
wRARcommented, Oct 29, 2022

AFAICS this is fixed in newer libxml2 so I don’t think this should stay open.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Extract href from HTML Element not working using node.js
I'm trying to scrape data from that website: https://www.gelbeseiten.de/Suche/Fotografen/Berlin and for some reason I can't get a specific ...
Read more >
The Ultimate Guide to Web Scraping with Node.js
Working through the examples in this guide, you will learn all the tips and tricks you need to become a pro at gathering...
Read more >
Web Scraping with JavaScript and NodeJS - ScrapingBee
Unlike Cheerio, however, jsdom does not only parse HTML into a DOM tree, it can also handle embedded JavaScript code and it allows...
Read more >
Web Scraping with TypeScript and Node.js - This Dot Labs
If you want to skip straight to the finish code example, ... Now that we have HTML to work with, we want to...
Read more >
Web Scraping in Javascript and NodeJS - ZenRows
For the code to work, you will need Node (or nvm) and npm installed. ... We will pass the HTML to cheerio and...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found