Guarantee a Cheerio.load(dom) overload
See original GitHub issueSince there is no built-in stream-reading method in Cheerio (see the discussion), I have built my own:
function fromStream(stream) {
return new Promise((resolve, reject) => {
const parser = new htmlparser.Parser(new htmlparser.DomHandler((err, dom) => {
if (err) {
reject(err);
} else {
resolve(cheerio.load(dom)); // <-- Not public API!
}
}));
stream.on('error', reject)
.pipe(parser)
.on('error', reject);
});
}
Even though the call cheerio.load(dom)
works*, it actually does not conform to Cheerio’s public API, which states that load
only accepts a string (cf. README, code).
Could the public API be extended to include a Cheerio.load(dom)
overload, where dom
is a DOM tree compatible to the output produced by htmlparser.DomHandler
?
*) see https://github.com/IonicaBizau/scrape-it/issues/83#issuecomment-353850115.
Issue Analytics
- State:
- Created 6 years ago
- Reactions:5
- Comments:8 (2 by maintainers)
Top Results From Across the Web
convert cheerio.load() to a DOM object - Stack Overflow
I'm trying to learn how to make a web scraper and save content from a site into a text file using node. My...
Read more >How to Scrape HTML Table in JavaScript + Ready-To-Use Code
Follow this simple script to extract data from any HTML table and export it ... html, so we can then pass it to...
Read more >How to Scrape Web Pages with Cheerio in Node.js - ZenRows
Get the data from any web page doing web scraping with Cheerio in NodeJS. ... This is the recommended way to load HTML...
Read more >What are some alternatives to cheerio? - StackShare
Alternatives to cheerio · Components · Virtual dom · Performance · Simplicity · Composable · Data flow · Declarative · Isn't an mvc...
Read more >How would you go about checking if a particular javascript file ...
If something not loaded it shows up as an error. ... information to go on, but the first thing to do is ensure...
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
@coryarmbrecht The streams I mentioned above and (afaik) parse5’s ParserStream only deal with the problem that you would need to store all the HTML in memory if you had not such streaming approaches. Why would you need to store all the HTML in memory if you were to feed it into the parser chunk-by-chunk anyway?
What you are describing, is called SAX parsing in case of XML, for example. By a quick search, I found sax-js, but I have no idea how up-to-date it is.
Glad to see there’s development here! I just hit this snag as I have been changing my sync node script to streams. @ComFreek I have looked at your nested links, but it is beyond my knowledge-
Is fragments support a requirement for streaming to Cheerio selectors? Like
$('a.new-link').each
? I guess it comes down to how chunks are separated, and it makes sense that you need to wait for certain tags (large containers) to be closed.If I wanted to start going in your direction and try get Cheerio to work with streams (I was thinking a through stream), where should I start? It sounds like without fragment support, I can’t just do something like: