question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

duplicated url are crawled twice

See original GitHub issue

What is the current behavior?

Duplicated urls are not skipped. The same url is crawled twice.

If the current behavior is a bug, please provide the steps to reproduce

const HCCrawler = require('./lib/hccrawler');

(async () => {
  const crawler = await HCCrawler.launch({
    evaluatePage: () => ({
      title: document.title,
    }),
    onSuccess: (result => {
      /console.log(result);
    }),
    skipDuplicates: true,
    jQuery: false,
    maxDepth: 3,
    args: ['--no-sandbox']
  });
  
  await crawler.queue([{
        url: 'https://www.example.com/'
      }, {
        url: 'https://www.example.com/'
  }]);

  await crawler.onIdle(); 
  await crawler.close(); 
})();

What is the expected behavior?

Crawled urls should be skipped even if they come from the queue.

Please tell us about your environment:

  • Version: lastest
  • Platform / OS version: Centos 7.1
  • Node.js version: v8.4.0

Issue Analytics

  • State:open
  • Created 5 years ago
  • Comments:6 (1 by maintainers)

github_iconTop GitHub Comments

2reactions
anton-k-gitcommented, Oct 17, 2020

is anyone consider creating a PR?

0reactions
iamprageethcommented, Jun 19, 2022

Just posting here hoping this would help someone. This is true it crawls duplicate URLs when concurrency > 1. So here is what I did.

  1. First created a sqlite database.
  2. Then in RequestStarted event, insert the current url.
  3. In preRequest function (You can pass this function along with options object) , just check whether there is a record of current url. If it is there that means url has crawler or still crawling. so return false. It will skip the url
  4. In RequestRetried, RequestFailed events, delete the url. So that will allows crawler to try it again.
Read more comments on GitHub >

github_iconTop Results From Across the Web

Crawl logs showing Site URL crawled twice successfully ...
4. On the Build Your Query page, go to the "SETTINGS" page and select "Remove duplicates" checkbox.
Read more >
How to Fix the Dreaded Duplicate URL in Google Analytics
Solve for duplicate URLs with/without trailing slashes in your Google ... want to exclude “size” from being crawled in your Google Search Console...
Read more >
Avoiding duplicate results
Pages are only crawled once. First: The same URL can never be indexed twice. If two results look alike, you'll see that their...
Read more >
Duplicate Content: Why does it happen and how to fix issues
Duplicate content is content that appears on the Internet in more than one place. That “one place” is defined as a location with...
Read more >
URL Canonicalization and the Canonical Tag | Documentation
Google will choose one URL as the canonical version and crawl that, and all other URLs will be considered duplicate URLs and crawled...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found