StatusUpdaterBolt to use provided domain name for routing
See original GitHub issueStatusUpdaterBolt if configured with routing byDomain
should use the routing key from metadata (if provided in the field defined by es.status.routing.fieldname
). Updates of the public suffix list (included in the crawler-commons dependency) may change the domain name and routing key, and may cause duplicate status records in the index and needless refetches of the same URL (cf. commoncrawl/news-crawl#28).
The simplest solution is just to use the provided routing key (similar as it’s done for routing byIP
). This would require only changes in URLPartitioner. Alternatively, StatusUpdaterBolt could check whether the routing key has changed and then send a deletion request using the original routing key and update the status document with the new routing key.
Issue Analytics
- State:
- Created 5 years ago
- Comments:9 (6 by maintainers)
Top GitHub Comments
the way I see it, it would copy to a new index. Aliasing could be used to preserve a generic name e.g. status if needed. Reindexing could also be useful e.g. for changing the sharding logic or the number of shards etc…
Thanks, I’ll test it during the next days. To pick the
_routing
value from ES is of course the most reliable solution, maybe better than usingmetadata.hostname
(seees.status.bucket.field
resp.es.status.routing.fieldname
).