question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Duplicates crawled data when limit is set

See original GitHub issue

Subject of the issue

There is a fixed number of pages which I crawl, e.g. 67. Each page contains 25 articles, which I crawl, but the last page (67-th) contains 3 articles. The total number of articles is 1653. I want x-ray to crawl articles from all pages by itself, without setting the limit equal to number of pages by myself, so I set the limit to ‘n’ :

function generateURLs(){
  x(url, '.h3 a', [{
    link: '@href'
  }])
    .paginate('.page a@href')
    .limit('n')
    .write('results.json');
}

And in that case x-ray keeps crawling articles from the pages infinitely, producing duplicate data in results.json. Otherwise, if I set limit to 67:

function generateURLs(){
  x(url, '.h3 a', [{
    link: '@href'
  }])
    .paginate('.page a@href')
    .limit(67)
    .write('results.json');
}

it generates 1675 articles instead of 1653. Is there any way to crawl all pages without setting limit? And if limit is set, is there any way to get crawled data without duplicates?

Your environment

  • version of node: v4.4.7
  • version of npm: 3.3.9

Issue Analytics

  • State:closed
  • Created 7 years ago
  • Comments:5 (2 by maintainers)

github_iconTop GitHub Comments

1reaction
hielfxcommented, Sep 19, 2016

I was having the same issue when I was using the following pattern:

xray(...)
(function(err,res){
    if(err){
        //stuff here
    }else{
        //more stuff there
    }
})
.paginate(...)
.limit(...)
.write(...);

But if I don’t use the function it works as expected.

0reactions
lathropdcommented, Mar 27, 2019

I believe this is a matter of selector handling…

Read more comments on GitHub >

github_iconTop Results From Across the Web

How to Quickly Identify Duplicate Content With a Site Crawl
This article is a simple breakdown of how to go about using a SEO site crawler to quickly identify duplicate content.
Read more >
Detecting Near-Duplicates for Web Crawling - Google Research
c) Data extraction: Given a moderate-sized collection of similar pages, say reviews at www.imdb.com, the goal is to identify the schema/DTD underlying the ......
Read more >
gocolly: How to Prevent duplicate crawling, restrict to unique ...
I suspected the 'Parallellsim:2' was causing the duplicates, however, some of the crawl message urls repeated more than 10 times each.
Read more >
Duplicate Content: 5 Myths and 5 Facts About How It Impacts ...
Technically, there's no set limit for how much duplicate content you can have. However, it's still worth minimizing the amount of duplicate content...
Read more >
How To Check For Duplicate Content - Screaming Frog
'Near Duplicates' require calculation at the end of the crawl via post 'Crawl Analysis' for it to be populated with data.
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found