StatusUpdaterBolt to use provided domain name for routingSee original GitHub issue
StatusUpdaterBolt if configured with routing
byDomain should use the routing key from metadata (if provided in the field defined by
es.status.routing.fieldname). Updates of the public suffix list (included in the crawler-commons dependency) may change the domain name and routing key, and may cause duplicate status records in the index and needless refetches of the same URL (cf. commoncrawl/news-crawl#28).
The simplest solution is just to use the provided routing key (similar as it’s done for routing
byIP). This would require only changes in URLPartitioner. Alternatively, StatusUpdaterBolt could check whether the routing key has changed and then send a deletion request using the original routing key and update the status document with the new routing key.
- Created 5 years ago
- Comments:9 (6 by maintainers)
Top GitHub Comments
the way I see it, it would copy to a new index. Aliasing could be used to preserve a generic name e.g. status if needed. Reindexing could also be useful e.g. for changing the sharding logic or the number of shards etc…
Thanks, I’ll test it during the next days. To pick the
_routing value from ES is of course the most reliable solution, maybe better than using