cheerio+v8 "leaks" memory from original HTML
See original GitHub issueThis is actually a bug in v8 (and probably any other JS engine), but it is particularly damaging in cheerio.
The v8 bug: https://code.google.com/p/v8/issues/detail?id=2869
In brief: let’s say I have code like this:
var cheerio = require('cheerio');
var hugeHtmlProducer = require('./hugeHtmlProducer');
var strings = [];
function handlePage(hugeHtml) {
var $ = cheerio.load(hugeHtml);
strings.push($('#tiny-string').text());
}
// hugeHtmlProducer.forEachAsync() loops like this:
// 1. fetch a huge HTML string
// 2. call the first callback on that HTML string
// 3. loop, dropping all references to the huge HTML string
// 4. calls the second callback
hugeHtmlProducer.forEachAsync(handlePage, function() { process.exit(); });
Then the strings
array is going to hold a tiny substring of each huge HTML string. Unfortunately, v8 will use that tiny substring as a reason to keep the entire HTML string in memory. As a result, the process runs out of memory.
You can see this in action using memwatch, at https://github.com/lloyd/node-memwatch, and dropping code like this at the top of your script:
var lastHeap = null
var memwatch = require('memwatch')
memwatch.on('stats', function(info) {
if (lastHeap) {
var hd = lastHeap.end();
console.log(JSON.stringify(hd, null, ' '));
}
console.log("base:" + (info.current_base / 1024 / 1024).toFixed(1) + "M fullGCs:" + info.num_full_gc + " incrGCs:" + info.num_inc_gc);
lastHeap = new memwatch.HeapDiff();
});
This is obviously a frustrating bug, as it isn’t cheerio’s place to second-guess v8. However, as it stands, huge memory leaks are the norm in cheerio.
A workaround: create an unleak function (such as the one at https://code.google.com/p/v8/issues/detail?id=2869) and use it by default in .text()
, .attr()
and similar methods. (To maintain speed, cheerio could use and provide a .leakyText()
method or some-such which does what the current .text()
method does.)
A more comprehensive workaround is to rebuild the DOM, so even if people access children[0].attribs.data
they won’t leak memory. I imagine that would slow down the parser considerably.
Whatever the solution, I think the normal examples (the stuff people would write in howto guides) should not leak memory. To me, that’s more important than saving a few milliseconds.
Issue Analytics
- State:
- Created 10 years ago
- Reactions:2
- Comments:32 (7 by maintainers)
Top GitHub Comments
(' ' + string).substr(1)
is shorter, easier to read, and faster. (And if you make a typo and write.substring()
, it’ll work just as well.)I believe it is still issue in V8 so yeah it affects Cheerio as well.