Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Duplicates crawled data when limit is set

See original GitHub issue

Subject of the issue

There is a fixed number of pages which I crawl, e.g. 67. Each page contains 25 articles, which I crawl, but the last page (67-th) contains 3 articles. The total number of articles is 1653. I want x-ray to crawl articles from all pages by itself, without setting the limit equal to number of pages by myself, so I set the limit to ‘n’ :

function generateURLs(){
  x(url, '.h3 a', [{
    link: '@href'
  }])
    .paginate('.page a@href')
    .limit('n')
    .write('results.json');
}

And in that case x-ray keeps crawling articles from the pages infinitely, producing duplicate data in results.json. Otherwise, if I set limit to 67:

function generateURLs(){
  x(url, '.h3 a', [{
    link: '@href'
  }])
    .paginate('.page a@href')
    .limit(67)
    .write('results.json');
}

it generates 1675 articles instead of 1653. Is there any way to crawl all pages without setting limit? And if limit is set, is there any way to get crawled data without duplicates?

Your environment

version of node: v4.4.7
version of npm: 3.3.9

Issue Analytics

State:
Created 7 years ago
Comments:5 (2 by maintainers)

Top GitHub Comments

1reaction

hielfxcommented, Sep 19, 2016

I was having the same issue when I was using the following pattern:

xray(...)
(function(err,res){
    if(err){
        //stuff here
    }else{
        //more stuff there
    }
})
.paginate(...)
.limit(...)
.write(...);

But if I don’t use the function it works as expected.

0reactions

lathropdcommented, Mar 27, 2019

I believe this is a matter of selector handling…

Top Results From Across the Web

How to Quickly Identify Duplicate Content With a Site Crawl

This article is a simple breakdown of how to go about using a SEO site crawler to quickly identify duplicate content.

Detecting Near-Duplicates for Web Crawling - Google Research

c) Data extraction: Given a moderate-sized collection of similar pages, say reviews at www.imdb.com, the goal is to identify the schema/DTD underlying the ......

gocolly: How to Prevent duplicate crawling, restrict to unique ...

I suspected the 'Parallellsim:2' was causing the duplicates, however, some of the crawl message urls repeated more than 10 times each.

Duplicate Content: 5 Myths and 5 Facts About How It Impacts ...

Technically, there's no set limit for how much duplicate content you can have. However, it's still worth minimizing the amount of duplicate content...

How To Check For Duplicate Content - Screaming Frog

'Near Duplicates' require calculation at the end of the crawl via post 'Crawl Analysis' for it to be populated with data.