question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Crawler instances are not disposed

See original GitHub issue

Which package is this bug report for? If unsure which one to select, leave blank

@crawlee/core

Issue description

With the addition of the “Pausing” feature and its corresponding new warning messages, it occurred to me that I now have to press STRG+C twice each time I rebuild my crawler during development that crawler instances don’t seem to be disposed unless you stop the node process.

In the below example, the old crawler instances should by my understanding of garbage collection in JS have been garbage collected, unless the lib somehow keeps references over instantiations. But I’m happy to learn if my understanding or the example is flawed.

Usage: Install and run example, then after some cycles stop process in terminal with CTRL+C.

You will notice that all crawler instances with their dynamically assigned names are still there. The logging uses an instance method and not a static method (https://github.com/apify/crawlee/blob/49e270c71a07a821ee1e3aaabdff4f4d64d9fb6f/packages/basic-crawler/src/internals/basic-crawler.ts#L603)

This is a (potential) memory leak, impact of course depends on how much data you store on crawler instances and how long your process runs.

Ideas?

Code sample

{
  "name": "crawlee-reproduction",
  "type": "module",
  "dependencies": {
    "crawlee": "^3.1.1",
    "playwright": "^1.27.1"
  }
}
import Crawlee from "crawlee"
// Simulate a long running task, like a worker process
let i = 0
while(true) {
  await crawlerDisposalTest(i)
  i++
}
async function crawlerDisposalTest(i) {
  const navigationQueue = await Crawlee.RequestQueue.open()
  await navigationQueue.addRequest({ url: "https://crawlee.dev/" })
  // For illustrative purposes only
  Object.defineProperty(Crawlee.PlaywrightCrawler, "name", {
    writable: true,
    value: `PlaywrightCrawler#${i}`
  })
  let crawler = new Crawlee.PlaywrightCrawler({
    requestQueue: navigationQueue,
    postNavigationHooks: [
      ctx => { 
        console.log(`Visit #${i}`)
      }
    ],
    requestHandler: async ctx => { },
  })
  await crawler.run()
  await crawler.teardown()
  crawler = null
  console.log(`No more references to PlaywrightCrawler#${i} in my program!`)
  await navigationQueue.drop()
}

Package version

3.1.1

Node.js version

v16.13.1

Operating system

Ubuntu 18.04.6 LTS

Priority this issue should have

High

Issue Analytics

  • State:closed
  • Created 10 months ago
  • Comments:8 (1 by maintainers)

github_iconTop GitHub Comments

4reactions
matjaeckcommented, Nov 15, 2022

So it seems that removing the listeners fixes the issue:

diff --git a/packages/basic-crawler/src/internals/basic-crawler.ts b/packages/basic-crawler/src/internals/basic-crawler.ts
index 0bd87c44..592f15bf 100644
--- a/packages/basic-crawler/src/internals/basic-crawler.ts
+++ b/packages/basic-crawler/src/internals/basic-crawler.ts
@@ -1174,6 +1174,8 @@ export class BasicCrawler<Context extends CrawlingContext = BasicCrawlingContext
         }

         await this.autoscaledPool?.abort();
+        this.events.removeAllListeners();
+        process.removeAllListeners('SIGINT');
     }

     protected _handlePropertyNameChange<New, Old>({
diff --git a/packages/core/src/events/event_manager.ts b/packages/core/src/events/event_manager.ts
index 184abf1a..28027e95 100644
--- a/packages/core/src/events/event_manager.ts
+++ b/packages/core/src/events/event_manager.ts
@@ -105,4 +105,8 @@ export abstract class EventManager {
     waitForAllListenersToComplete() {
         return this.events.waitForAllListenersToComplete();
     }
+
+    removeAllListeners() {
+        return this.events.removeAllListeners();
+    }
 }

Here are the heap profiles for the example code below:

await new Promise(resolve => setTimeout(resolve, 5000))

import os from "os"
import Crawlee from "crawlee"

for (let i = 0; i < 3; i++)
  await crawlerDisposalTest(i)

async function crawlerDisposalTest(i) {
  const navigationQueue = await Crawlee.RequestQueue.open()
  await navigationQueue.addRequest({ url: "https://google.de" })
  // For illustrative purposes only
  Object.defineProperty(Crawlee.PlaywrightCrawler, "name", {
    writable: true,
    value: `PlaywrightCrawler#${i}`
  })
  let crawler = new Crawlee.PlaywrightCrawler({
    requestQueue: navigationQueue,
    requestHandler: async ctx => { }
  })
  await crawler.run()
  await crawler.teardown()
  crawler = null
  console.log(`Available system memory: ${os.freemem()} bytes.`)
  await navigationQueue.drop()
}
</details

On master: image Note the blue bars, they represent the retained memory of the crawler instances, which can not be garbage collected due to the references in the two event emitters events map.

With above fix: image

I’m happy to send a pull request, but don’t know if this conflicts with anything else so feedback is welcomed.

There still seem to be “smaller” leaks, but these are not connected to this issue.

1reaction
matjaeckcommented, Nov 14, 2022

So the code that was intended to store images never runs, as Chrom(ium)e does not assign the resource type like I thought it would and we probably shouldn’t crawl unsplash.com without good reason but that’s all not important.

The relevant takeaway of my flawed attempt to increase memory consumption is that the crawler instances have quite a sizeable memory footprint even if you don’t store any data on them. The above (updated!) example leaks 1,117 GB on my machine in 100 cycles.

Read more comments on GitHub >

github_iconTop Results From Across the Web

About “Chart was not disposed” warning - amCharts
Disposing previously created chart instances that are no longer needed is the only way to prevent the warning from appearing.
Read more >
How bad is it to not dispose() in Powershell? - Stack Overflow
I run ad-hoc scripts, I waste ~3-50MB of RAM because I fail to dispose my objects, I close the PowerShell window and the...
Read more >
What is the most proper way of disposing of HtmlDocument ...
The best way of disposing of HtmlDocument is - not disposing :-) Just as XPathDocument and XmlDocument are not "disposable", neither should ...
Read more >
New-SPEnterpriseSearchCrawlComponent (sharepoint-server)
Creates a new crawl component and adds it to an inactive search topology in ... disposed of if an assignment collection or the...
Read more >
Tutorials - Crawlers - Aperture
Furthermore, CrawlerHandler contains methods that let the Crawler notify that a resource has not been changed (when it is crawling incrementally), that a ......
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found