duplicated url are crawled twice
See original GitHub issueWhat is the current behavior?
Duplicated urls are not skipped. The same url is crawled twice.
If the current behavior is a bug, please provide the steps to reproduce
const HCCrawler = require('./lib/hccrawler');
(async () => {
const crawler = await HCCrawler.launch({
evaluatePage: () => ({
title: document.title,
}),
onSuccess: (result => {
/console.log(result);
}),
skipDuplicates: true,
jQuery: false,
maxDepth: 3,
args: ['--no-sandbox']
});
await crawler.queue([{
url: 'https://www.example.com/'
}, {
url: 'https://www.example.com/'
}]);
await crawler.onIdle();
await crawler.close();
})();
What is the expected behavior?
Crawled urls should be skipped even if they come from the queue
.
Please tell us about your environment:
- Version: lastest
- Platform / OS version: Centos 7.1
- Node.js version: v8.4.0
Issue Analytics
- State:
- Created 5 years ago
- Comments:6 (1 by maintainers)
Top Results From Across the Web
Crawl logs showing Site URL crawled twice successfully ...
4. On the Build Your Query page, go to the "SETTINGS" page and select "Remove duplicates" checkbox.
Read more >How to Fix the Dreaded Duplicate URL in Google Analytics
Solve for duplicate URLs with/without trailing slashes in your Google ... want to exclude “size” from being crawled in your Google Search Console...
Read more >Avoiding duplicate results
Pages are only crawled once. First: The same URL can never be indexed twice. If two results look alike, you'll see that their...
Read more >Duplicate Content: Why does it happen and how to fix issues
Duplicate content is content that appears on the Internet in more than one place. That “one place” is defined as a location with...
Read more >URL Canonicalization and the Canonical Tag | Documentation
Google will choose one URL as the canonical version and crawl that, and all other URLs will be considered duplicate URLs and crawled...
Read more >
Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free
Top Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
is anyone consider creating a PR?
Just posting here hoping this would help someone. This is true it crawls duplicate URLs when concurrency > 1. So here is what I did.