Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Guarantee a Cheerio.load(dom) overload

See original GitHub issue

Since there is no built-in stream-reading method in Cheerio (see the discussion), I have built my own:

function fromStream(stream) {
	return new Promise((resolve, reject) => {
		const parser = new htmlparser.Parser(new htmlparser.DomHandler((err, dom) => {
			if (err) {
				reject(err);
			} else {
				resolve(cheerio.load(dom)); // <-- Not public API!
			}
		}));

		stream.on('error', reject)
			.pipe(parser)
			.on('error', reject);
	});
}

Even though the call cheerio.load(dom) works*, it actually does not conform to Cheerio’s public API, which states that load only accepts a string (cf. README, code).

Could the public API be extended to include a Cheerio.load(dom) overload, where dom is a DOM tree compatible to the output produced by htmlparser.DomHandler?

*) see https://github.com/IonicaBizau/scrape-it/issues/83#issuecomment-353850115.

Issue Analytics

State:
Created 6 years ago
Reactions:5
Comments:8 (2 by maintainers)

Top GitHub Comments

3reactions

ComFreekcommented, Feb 8, 2018

@coryarmbrecht The streams I mentioned above and (afaik) parse5’s ParserStream only deal with the problem that you would need to store all the HTML in memory if you had not such streaming approaches. Why would you need to store all the HTML in memory if you were to feed it into the parser chunk-by-chunk anyway?

What you are describing, is called SAX parsing in case of XML, for example. By a quick search, I found sax-js, but I have no idea how up-to-date it is.

2reactions

coryarmbrechtcommented, Feb 7, 2018

Glad to see there’s development here! I just hit this snag as I have been changing my sync node script to streams. @ComFreek I have looked at your nested links, but it is beyond my knowledge-

Is fragments support a requirement for streaming to Cheerio selectors? Like $('a.new-link').each? I guess it comes down to how chunks are separated, and it makes sense that you need to wait for certain tags (large containers) to be closed.

If I wanted to start going in your direction and try get Cheerio to work with streams (I was thinking a through stream), where should I start? It sounds like without fragment support, I can’t just do something like:

const links = []
let readStream = fs.createReadStream(htmlFile);
    let chunks = []

    // Listen for data
    readStream.on('data', chunk => {
        //chunks.push(chunk)
        $('a.new-link').each(function(i, elem) { 
            links[i] = elem
        })
    });