Crawler instances are not disposed
See original GitHub issueWhich package is this bug report for? If unsure which one to select, leave blank
Issue description
With the addition of the “Pausing” feature and its corresponding new warning messages, it occurred to me that I now have to press STRG+C twice each time I rebuild my crawler during development that crawler instances don’t seem to be disposed unless you stop the node process.
In the below example, the old crawler instances should by my understanding of garbage collection in JS have been garbage collected, unless the lib somehow keeps references over instantiations. But I’m happy to learn if my understanding or the example is flawed.
Usage: Install and run example, then after some cycles stop process in terminal with CTRL+C.
You will notice that all crawler instances with their dynamically assigned names are still there. The logging uses an instance method and not a static method (https://github.com/apify/crawlee/blob/49e270c71a07a821ee1e3aaabdff4f4d64d9fb6f/packages/basic-crawler/src/internals/basic-crawler.ts#L603)
This is a (potential) memory leak, impact of course depends on how much data you store on crawler instances and how long your process runs.
Ideas?
Code sample
{
"name": "crawlee-reproduction",
"type": "module",
"dependencies": {
"crawlee": "^3.1.1",
"playwright": "^1.27.1"
}
}
import Crawlee from "crawlee"
// Simulate a long running task, like a worker process
let i = 0
while(true) {
await crawlerDisposalTest(i)
i++
}
async function crawlerDisposalTest(i) {
const navigationQueue = await Crawlee.RequestQueue.open()
await navigationQueue.addRequest({ url: "https://crawlee.dev/" })
// For illustrative purposes only
Object.defineProperty(Crawlee.PlaywrightCrawler, "name", {
writable: true,
value: `PlaywrightCrawler#${i}`
})
let crawler = new Crawlee.PlaywrightCrawler({
requestQueue: navigationQueue,
postNavigationHooks: [
ctx => {
console.log(`Visit #${i}`)
}
],
requestHandler: async ctx => { },
})
await crawler.run()
await crawler.teardown()
crawler = null
console.log(`No more references to PlaywrightCrawler#${i} in my program!`)
await navigationQueue.drop()
}
Package version
3.1.1
Node.js version
v16.13.1
Operating system
Ubuntu 18.04.6 LTS
Priority this issue should have
High
Issue Analytics
- State:
- Created 10 months ago
- Comments:8 (1 by maintainers)
So it seems that removing the listeners fixes the issue:
Here are the heap profiles for the example code below:
On master:
Note the blue bars, they represent the retained memory of the crawler instances, which can not be garbage collected due to the references in the two event emitters events map.
With above fix:
I’m happy to send a pull request, but don’t know if this conflicts with anything else so feedback is welcomed.
There still seem to be “smaller” leaks, but these are not connected to this issue.
So the code that was intended to store images never runs, as Chrom(ium)e does not assign the resource type like I thought it would and we probably shouldn’t crawl unsplash.com without good reason but that’s all not important.
The relevant takeaway of my flawed attempt to increase memory consumption is that the crawler instances have quite a sizeable memory footprint even if you don’t store any data on them. The above (updated!) example leaks 1,117 GB on my machine in 100 cycles.