question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

cheerio+v8 "leaks" memory from original HTML

See original GitHub issue

This is actually a bug in v8 (and probably any other JS engine), but it is particularly damaging in cheerio.

The v8 bug: https://code.google.com/p/v8/issues/detail?id=2869

In brief: let’s say I have code like this:

var cheerio = require('cheerio');
var hugeHtmlProducer = require('./hugeHtmlProducer');

var strings = [];

function handlePage(hugeHtml) {
  var $ = cheerio.load(hugeHtml);
  strings.push($('#tiny-string').text());
}

// hugeHtmlProducer.forEachAsync() loops like this:
// 1. fetch a huge HTML string
// 2. call the first callback on that HTML string
// 3. loop, dropping all references to the huge HTML string
// 4. calls the second callback
hugeHtmlProducer.forEachAsync(handlePage, function() { process.exit(); });

Then the strings array is going to hold a tiny substring of each huge HTML string. Unfortunately, v8 will use that tiny substring as a reason to keep the entire HTML string in memory. As a result, the process runs out of memory.

You can see this in action using memwatch, at https://github.com/lloyd/node-memwatch, and dropping code like this at the top of your script:

var lastHeap = null
var memwatch = require('memwatch')
memwatch.on('stats', function(info) {
  if (lastHeap) {
    var hd = lastHeap.end();
    console.log(JSON.stringify(hd, null, '  '));
  }
  console.log("base:" + (info.current_base / 1024 / 1024).toFixed(1) + "M fullGCs:" + info.num_full_gc + " incrGCs:" + info.num_inc_gc);
  lastHeap = new memwatch.HeapDiff();
});

This is obviously a frustrating bug, as it isn’t cheerio’s place to second-guess v8. However, as it stands, huge memory leaks are the norm in cheerio.

A workaround: create an unleak function (such as the one at https://code.google.com/p/v8/issues/detail?id=2869) and use it by default in .text(), .attr() and similar methods. (To maintain speed, cheerio could use and provide a .leakyText() method or some-such which does what the current .text() method does.)

A more comprehensive workaround is to rebuild the DOM, so even if people access children[0].attribs.data they won’t leak memory. I imagine that would slow down the parser considerably.

Whatever the solution, I think the normal examples (the stuff people would write in howto guides) should not leak memory. To me, that’s more important than saving a few milliseconds.

Issue Analytics

  • State:closed
  • Created 10 years ago
  • Reactions:2
  • Comments:32 (7 by maintainers)

github_iconTop GitHub Comments

2reactions
adamhoopercommented, Feb 24, 2015

(' ' + string).substr(1) is shorter, easier to read, and faster. (And if you make a typo and write .substring(), it’ll work just as well.)

1reaction
5saviahvcommented, Mar 29, 2021

I believe it is still issue in V8 so yeah it affects Cheerio as well.

Read more comments on GitHub >

github_iconTop Results From Across the Web

4 Types of Memory Leaks in JavaScript and How to Get Rid Of ...
The main cause for leaks in garbage collected languages are unwanted references. To understand what unwanted references are, first we need to ......
Read more >
Detached window memory leaks - web.dev
Detached windows are a common type of memory leak that is particularly difficult to find and fix.
Read more >
does jQuery .html() method leak memory? - Stack Overflow
Short answer: No. Long answer: You likely have something else going on in your page/code. A memory leak is generally caused by a...
Read more >
Memory Leaks in IE8 and IE9 (Fixed in IE10) - Baffling Browsers
IE8 leaks memory worse than IE6 ever did, yet I haven't been able to find any ... The FROM_AREA contains HTML like in...
Read more >
Causes of Memory Leaks in JavaScript and How to Avoid Them
Insufficient care about memory management generally doesn't produce dramatic consequences when it comes to "old-fashioned" web pages.
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found