Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

De-Duplicate documents in CrawlDB (Solr)

See original GitHub issue

Since now we have a more sophisticated definition of the id field (with timestamp included), we have to think on de-duplication of the documents.

I am opening a discussion channel here to define de-duplication. Some of the suggestions are:

Compare SHA256 hash of the raw_content i.e. signature field (but this will enforce fetching of the duplicate document even though we are not storing it)
Compare the url field

We can refer here for the implementation.

Issue Analytics

State:
Created 7 years ago
Comments:16 (16 by maintainers)

Top GitHub Comments

1reaction

sujen1412commented, Feb 3, 2017

De-duplication of Outlinks:

Agreed, you can dedup by crawlid+url

De-duplication of Content:

We need to flush out more details on this and how will it be implemented, I am open to start a discussion on this if this is on the timeline right now, or else we can defer until it comes under development.

1reaction

karanjeetscommented, Feb 2, 2017

crawl_id-url combination is better for what ?

Better for the dedupe_id

Let me elaborate on my point. There are two types of de-duplication.

De-duplication of Outlinks: This is what we are discussing here. The objective is to de-duplicate outlinks so that we don’t end up crawling the same page again and again. For ex - if every page of a website points back to it’s home page, we would like to remove the home page URL from the outlinks so that Sparkler don’t fetch it again.
De-duplication of Content: This is, what I think, you are talking about. This is when you are refreshing the crawl or we want the crawler to fetch the page again based on the property retry_interval_seconds. This is not implemented yet and when it will be, we will add the newly fetched document into our index and it will have the same dedupe_id. We can handle this with different Solr handlers.

I didnt get what you are trying to say here. What fields contribute to de-duplication ? And what are you trying to deduplicate, url or content or something else ?

I was thinking on the line of generalization and giving the control to user i.e. define what schema field combination will define the de-duplication id. Let’s push this back because it was just a random thought and not helping the issue.

Top Results From Across the Web

De-Duplication | Apache Solr Reference Guide 8.1

Preventing duplicate or near duplicate documents from entering an index or tagging documents with a signature/fingerprint for duplicate field collapsing can ...

how to avoid duplicate pages in nutch and solr?

1. via Nutch, use the DeduplicationJob to mark URL's are duplicate and then the clean job to remove them from your indexing backend....

How to remove missing pages from Solr when old Nutch ...

What is an appropriate way to make Nutch remove entries in Solr if the old Nutch crawldb has been deleted (accidentally or otherwise)?...

Lucid.anda Connector Framework | Fusion Connectors 5.4

Property indexCrawlDBToSolr - index most recent crawldb in Solr ... This example finds duplicates based on the h2 fields in each document.

Duplicate URLs - Nemani, Raj - org.apache.lucene.nutch-user

across the entire solr Index, not just on the documents generated and submitted ... bin/nutch solrindex $solr_endpoint crawl/crawldb crawl/linkdb ...