Duplicates crawled data when limit is set
See original GitHub issueSubject of the issue
There is a fixed number of pages which I crawl, e.g. 67. Each page contains 25 articles, which I crawl, but the last page (67-th) contains 3 articles. The total number of articles is 1653. I want x-ray to crawl articles from all pages by itself, without setting the limit equal to number of pages by myself, so I set the limit to ‘n’ :
function generateURLs(){
x(url, '.h3 a', [{
link: '@href'
}])
.paginate('.page a@href')
.limit('n')
.write('results.json');
}
And in that case x-ray keeps crawling articles from the pages infinitely, producing duplicate data in results.json. Otherwise, if I set limit to 67:
function generateURLs(){
x(url, '.h3 a', [{
link: '@href'
}])
.paginate('.page a@href')
.limit(67)
.write('results.json');
}
it generates 1675 articles instead of 1653. Is there any way to crawl all pages without setting limit? And if limit is set, is there any way to get crawled data without duplicates?
Your environment
- version of node: v4.4.7
- version of npm: 3.3.9
Issue Analytics
- State:
- Created 7 years ago
- Comments:5 (2 by maintainers)
Top Results From Across the Web
How to Quickly Identify Duplicate Content With a Site Crawl
This article is a simple breakdown of how to go about using a SEO site crawler to quickly identify duplicate content.
Read more >Detecting Near-Duplicates for Web Crawling - Google Research
c) Data extraction: Given a moderate-sized collection of similar pages, say reviews at www.imdb.com, the goal is to identify the schema/DTD underlying the ......
Read more >gocolly: How to Prevent duplicate crawling, restrict to unique ...
I suspected the 'Parallellsim:2' was causing the duplicates, however, some of the crawl message urls repeated more than 10 times each.
Read more >Duplicate Content: 5 Myths and 5 Facts About How It Impacts ...
Technically, there's no set limit for how much duplicate content you can have. However, it's still worth minimizing the amount of duplicate content...
Read more >How To Check For Duplicate Content - Screaming Frog
'Near Duplicates' require calculation at the end of the crawl via post 'Crawl Analysis' for it to be populated with data.
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
I was having the same issue when I was using the following pattern:
But if I don’t use the function it works as expected.
I believe this is a matter of selector handling…