Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

duplicated url are crawled twice

See original GitHub issue

What is the current behavior?

Duplicated urls are not skipped. The same url is crawled twice.

If the current behavior is a bug, please provide the steps to reproduce

const HCCrawler = require('./lib/hccrawler');

(async () => {
  const crawler = await HCCrawler.launch({
    evaluatePage: () => ({
      title: document.title,
    }),
    onSuccess: (result => {
      /console.log(result);
    }),
    skipDuplicates: true,
    jQuery: false,
    maxDepth: 3,
    args: ['--no-sandbox']
  });
  
  await crawler.queue([{
        url: 'https://www.example.com/'
      }, {
        url: 'https://www.example.com/'
  }]);

  await crawler.onIdle(); 
  await crawler.close(); 
})();

What is the expected behavior?

Crawled urls should be skipped even if they come from the queue.

Please tell us about your environment:

Version: lastest
Platform / OS version: Centos 7.1
Node.js version: v8.4.0

Issue Analytics

State:
Created 5 years ago
Comments:6 (1 by maintainers)

Top GitHub Comments

2reactions

anton-k-gitcommented, Oct 17, 2020

is anyone consider creating a PR?

0reactions

iamprageethcommented, Jun 19, 2022

Just posting here hoping this would help someone. This is true it crawls duplicate URLs when concurrency > 1. So here is what I did.

First created a sqlite database.
Then in RequestStarted event, insert the current url.
In preRequest function (You can pass this function along with options object) , just check whether there is a record of current url. If it is there that means url has crawler or still crawling. so return false. It will skip the url
In RequestRetried, RequestFailed events, delete the url. So that will allows crawler to try it again.