Elasticsearch IndexerBolt: tuples with canonical URL may not get acked
See original GitHub issueThis issue was seen in a topology fed by WARCSpout. The failure of unacked tuples triggered #825. The failure was reproducible: if the topology was run again with the same input an mostly overlapping (but not identical) set of URLs were logged as failed. In addition, the failed URLs are missing in the status index.
A closer analysis showed that pages with canonical URL were involved. One example:
2020-10-06 19:41:44.202 c.d.s.w.WARCSpout Thread-4-spout-executor[8 8] [INFO] Fetched https://www.obozrevatel.com/ukr/dnipro/city/u-dnipri-ta-oblasti-ogolosili-shtormove-poperedzhennya.htm with status 200
2020-10-06 19:41:44.351 c.d.s.w.WARCSpout Thread-4-spout-executor[8 8] [INFO] Fetched https://www.obozrevatel.com/ukr/dnipro/city/u-dnipri-ta-oblasti-ogolosili-shtormove-poperedzhennya/amp.htm with status 200
2020-10-06 19:41:45.608 c.d.s.b.JSoupParserBolt Thread-14-parse-executor[6 6] [INFO] Parsing : starting https://www.obozrevatel.com/ukr/dnipro/city/u-dnipri-ta-oblasti-ogolosili-shtormove-poperedzhennya.htm
2020-10-06 19:41:45.636 c.d.s.b.JSoupParserBolt Thread-14-parse-executor[6 6] [INFO] Parsed https://www.obozrevatel.com/ukr/dnipro/city/u-dnipri-ta-oblasti-ogolosili-shtormove-poperedzhennya.htm in 23 msec
2020-10-06 19:41:45.674 c.d.s.b.JSoupParserBolt Thread-14-parse-executor[6 6] [INFO] Total for https://www.obozrevatel.com/ukr/dnipro/city/u-dnipri-ta-oblasti-ogolosili-shtormove-poperedzhennya.htm - 61 msec
2020-10-06 19:41:45.675 c.d.s.e.b.IndexerBolt Thread-10-index-executor[3 3] [INFO] Indexing https://www.obozrevatel.com/ukr/dnipro/city/u-dnipri-ta-oblasti-ogolosili-shtormove-poperedzhennya.htm as https://www.obozrevatel.com/ukr/dnipro/city/u-dnipri-ta-oblasti-ogolosili-shtormove-poperedzhennya.htm
2020-10-06 19:41:46.191 c.d.s.b.JSoupParserBolt Thread-14-parse-executor[6 6] [INFO] Parsing : starting https://www.obozrevatel.com/ukr/dnipro/city/u-dnipri-ta-oblasti-ogolosili-shtormove-poperedzhennya/amp.htm
2020-10-06 19:41:46.202 c.d.s.b.JSoupParserBolt Thread-14-parse-executor[6 6] [INFO] Parsed https://www.obozrevatel.com/ukr/dnipro/city/u-dnipri-ta-oblasti-ogolosili-shtormove-poperedzhennya/amp.htm in 7 msec
2020-10-06 19:41:46.212 c.d.s.b.JSoupParserBolt Thread-14-parse-executor[6 6] [INFO] Total for https://www.obozrevatel.com/ukr/dnipro/city/u-dnipri-ta-oblasti-ogolosili-shtormove-poperedzhennya/amp.htm - 17 msec
2020-10-06 19:41:46.215 c.d.s.e.b.IndexerBolt Thread-10-index-executor[3 3] [INFO] Indexing https://www.obozrevatel.com/ukr/dnipro/city/u-dnipri-ta-oblasti-ogolosili-shtormove-poperedzhennya/amp.htm as https://www.obozrevatel.com/ukr/dnipro/city/u-dnipri-ta-oblasti-ogolosili-shtormove-poperedzhennya.htm
2020-10-06 19:41:46.754 c.d.s.e.b.IndexerBolt I/O dispatcher 12 [WARN] Could not find unacked tuple for 50571c0ffec7d295bb754b4847bdf2edace07885895ca09e5d459eeddd03c6f7
2020-10-06 19:51:40.108 c.d.s.e.b.IndexerBolt I/O dispatcher 12 [INFO] Bulk response [246] : items 100, waitAck 42, acked 99, failed 0
2020-10-06 19:51:43.985 c.d.s.w.WARCSpout Thread-4-spout-executor[8 8] [ERROR] Failed - unable to replay WARC record of: https://www.obozrevatel.com/ukr/dnipro/city/u-dnipri-ta-oblasti-ogolosili-shtormove-poperedzhennya.htm
Reduced to be more readable:
19:41:44.202 [INFO] Fetched A with status 200
19:41:44.351 [INFO] Fetched B with status 200
19:41:45.608 [INFO] Parsing : starting A
19:41:45.636 [INFO] Parsed A in 23 msec
19:41:45.674 [INFO] Total for A - 61 msec
19:41:45.675 [INFO] Indexing A as A
19:41:46.191 [INFO] Parsing : starting B
19:41:46.202 [INFO] Parsed B in 7 msec
19:41:46.212 [INFO] Total for B - 17 msec
19:41:46.215 [INFO] Indexing B as A
19:41:46.754 [WARN] Could not find unacked tuple for sha256sum(A)
19:51:40.108 [INFO] Bulk response [246] : items 100, waitAck 42, acked 99, failed 0
19:51:43.985 [ERROR] Failed - unable to replay WARC record of: A
Note: there is no prior Bulk response
log message, so this means both pages/URLs have been processed in the first bulk. The hash is verified as sha256 hash of A by:
$> echo -n 'https://www.obozrevatel.com/ukr/dnipro/city/u-dnipri-ta-oblasti-ogolosili-shtormove-poperedzhennya.htm' \
| sha256sum
50571c0ffec7d295bb754b4847bdf2edace07885895ca09e5d459eeddd03c6f7 -
Issue Analytics
- State:
- Created 3 years ago
- Comments:7 (4 by maintainers)
Top Results From Across the Web
Stormcrawler not fetching/indexing pages for elasticsearch
1 Answer 1 · Perfekt now it appears in the file but not in the 'index' of elastic search and also the FETCHED...
Read more >What's new in StormCrawler 1.18 - DigitalPebble's Blog
Elasticsearch. Can't skip text or url fields in indexing #818. Elasticsearch IndexerBolt: tuples with canonical URL may not get acked #832.
Read more >5 common mistakes with rel=canonical - Google Developers
Check that rel=canonical points to an existent URL with good content (that is, not a 404 , or worse, a soft 404 )....
Read more >What could cause Google to not honor canonical URLs? - Moz
I have a strange situation on a website, when I do a Google query of site:example.com all the top indexed results appear to...
Read more >Canonical URLs: A Beginner's Guide to Canonical Tags
Make sure that this is set to no so you will be able to manually specify the canonical URL for categories. Magento Custom...
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
Btw., docID is calculated twice (second time as
sha256hex
).to put this issue into context: the worst thing that can happen is that the second URL (whichever it is) will eventually fail with a timeout but should get replayed by the spout and will eventually make it to the status and content indices. Of course, if the spout can’t replay the inputs, you’d lose it. But then a similar content for the same URL would have been indexed anyway