question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Guarantee a Cheerio.load(dom) overload

See original GitHub issue

Since there is no built-in stream-reading method in Cheerio (see the discussion), I have built my own:

function fromStream(stream) {
	return new Promise((resolve, reject) => {
		const parser = new htmlparser.Parser(new htmlparser.DomHandler((err, dom) => {
			if (err) {
				reject(err);
			} else {
				resolve(cheerio.load(dom)); // <-- Not public API!
			}
		}));

		stream.on('error', reject)
			.pipe(parser)
			.on('error', reject);
	});
}

Even though the call cheerio.load(dom) works*, it actually does not conform to Cheerio’s public API, which states that load only accepts a string (cf. README, code).

Could the public API be extended to include a Cheerio.load(dom) overload, where dom is a DOM tree compatible to the output produced by htmlparser.DomHandler?

*) see https://github.com/IonicaBizau/scrape-it/issues/83#issuecomment-353850115.

Issue Analytics

  • State:closed
  • Created 6 years ago
  • Reactions:5
  • Comments:8 (2 by maintainers)

github_iconTop GitHub Comments

3reactions
ComFreekcommented, Feb 8, 2018

@coryarmbrecht The streams I mentioned above and (afaik) parse5’s ParserStream only deal with the problem that you would need to store all the HTML in memory if you had not such streaming approaches. Why would you need to store all the HTML in memory if you were to feed it into the parser chunk-by-chunk anyway?

What you are describing, is called SAX parsing in case of XML, for example. By a quick search, I found sax-js, but I have no idea how up-to-date it is.

2reactions
coryarmbrechtcommented, Feb 7, 2018

Glad to see there’s development here! I just hit this snag as I have been changing my sync node script to streams. @ComFreek I have looked at your nested links, but it is beyond my knowledge-

Is fragments support a requirement for streaming to Cheerio selectors? Like $('a.new-link').each? I guess it comes down to how chunks are separated, and it makes sense that you need to wait for certain tags (large containers) to be closed.

If I wanted to start going in your direction and try get Cheerio to work with streams (I was thinking a through stream), where should I start? It sounds like without fragment support, I can’t just do something like:

const links = []
let readStream = fs.createReadStream(htmlFile);
    let chunks = []

    // Listen for data
    readStream.on('data', chunk => {
        //chunks.push(chunk)
        $('a.new-link').each(function(i, elem) { 
            links[i] = elem
        })
    });
Read more comments on GitHub >

github_iconTop Results From Across the Web

convert cheerio.load() to a DOM object - Stack Overflow
I'm trying to learn how to make a web scraper and save content from a site into a text file using node. My...
Read more >
How to Scrape HTML Table in JavaScript + Ready-To-Use Code
Follow this simple script to extract data from any HTML table and export it ... html, so we can then pass it to...
Read more >
How to Scrape Web Pages with Cheerio in Node.js - ZenRows
Get the data from any web page doing web scraping with Cheerio in NodeJS. ... This is the recommended way to load HTML...
Read more >
What are some alternatives to cheerio? - StackShare
Alternatives to cheerio · Components · Virtual dom · Performance · Simplicity · Composable · Data flow · Declarative · Isn't an mvc...
Read more >
How would you go about checking if a particular javascript file ...
If something not loaded it shows up as an error. ... information to go on, but the first thing to do is ensure...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found