De-Duplicate documents in CrawlDB (Solr)
See original GitHub issueSince now we have a more sophisticated definition of the id
field (with timestamp included), we have to think on de-duplication of the documents.
I am opening a discussion channel here to define de-duplication. Some of the suggestions are:
- Compare SHA256 hash of the raw_content i.e.
signature
field (but this will enforce fetching of the duplicate document even though we are not storing it) - Compare the
url
field
We can refer here for the implementation.
Issue Analytics
- State:
- Created 7 years ago
- Comments:16 (16 by maintainers)
Top Results From Across the Web
De-Duplication | Apache Solr Reference Guide 8.1
Preventing duplicate or near duplicate documents from entering an index or tagging documents with a signature/fingerprint for duplicate field collapsing can ...
Read more >how to avoid duplicate pages in nutch and solr?
1. via Nutch, use the DeduplicationJob to mark URL's are duplicate and then the clean job to remove them from your indexing backend....
Read more >How to remove missing pages from Solr when old Nutch ...
What is an appropriate way to make Nutch remove entries in Solr if the old Nutch crawldb has been deleted (accidentally or otherwise)?...
Read more >Lucid.anda Connector Framework | Fusion Connectors 5.4
Property indexCrawlDBToSolr - index most recent crawldb in Solr ... This example finds duplicates based on the h2 fields in each document.
Read more >Duplicate URLs - Nemani, Raj - org.apache.lucene.nutch-user
across the entire solr Index, not just on the documents generated and submitted ... bin/nutch solrindex $solr_endpoint crawl/crawldb crawl/linkdb ...
Read more >
Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free
Top Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Agreed, you can dedup by
crawlid+url
We need to flush out more details on this and how will it be implemented, I am open to start a discussion on this if this is on the timeline right now, or else we can defer until it comes under development.
Better for the dedupe_id
Let me elaborate on my point. There are two types of de-duplication.
De-duplication of Outlinks: This is what we are discussing here. The objective is to de-duplicate outlinks so that we don’t end up crawling the same page again and again. For ex - if every page of a website points back to it’s home page, we would like to remove the home page URL from the outlinks so that Sparkler don’t fetch it again.
De-duplication of Content: This is, what I think, you are talking about. This is when you are refreshing the crawl or we want the crawler to fetch the page again based on the property
retry_interval_seconds
. This is not implemented yet and when it will be, we will add the newly fetched document into our index and it will have the same dedupe_id. We can handle this with different Solr handlers.I was thinking on the line of generalization and giving the control to user i.e. define what schema field combination will define the de-duplication id. Let’s push this back because it was just a random thought and not helping the issue.