question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Elasticsearch IndexerBolt: tuples with canonical URL may not get acked

See original GitHub issue

This issue was seen in a topology fed by WARCSpout. The failure of unacked tuples triggered #825. The failure was reproducible: if the topology was run again with the same input an mostly overlapping (but not identical) set of URLs were logged as failed. In addition, the failed URLs are missing in the status index.

A closer analysis showed that pages with canonical URL were involved. One example:

2020-10-06 19:41:44.202 c.d.s.w.WARCSpout Thread-4-spout-executor[8 8] [INFO] Fetched https://www.obozrevatel.com/ukr/dnipro/city/u-dnipri-ta-oblasti-ogolosili-shtormove-poperedzhennya.htm with status 200
2020-10-06 19:41:44.351 c.d.s.w.WARCSpout Thread-4-spout-executor[8 8] [INFO] Fetched https://www.obozrevatel.com/ukr/dnipro/city/u-dnipri-ta-oblasti-ogolosili-shtormove-poperedzhennya/amp.htm with status 200
2020-10-06 19:41:45.608 c.d.s.b.JSoupParserBolt Thread-14-parse-executor[6 6] [INFO] Parsing : starting https://www.obozrevatel.com/ukr/dnipro/city/u-dnipri-ta-oblasti-ogolosili-shtormove-poperedzhennya.htm
2020-10-06 19:41:45.636 c.d.s.b.JSoupParserBolt Thread-14-parse-executor[6 6] [INFO] Parsed https://www.obozrevatel.com/ukr/dnipro/city/u-dnipri-ta-oblasti-ogolosili-shtormove-poperedzhennya.htm in 23 msec
2020-10-06 19:41:45.674 c.d.s.b.JSoupParserBolt Thread-14-parse-executor[6 6] [INFO] Total for https://www.obozrevatel.com/ukr/dnipro/city/u-dnipri-ta-oblasti-ogolosili-shtormove-poperedzhennya.htm - 61 msec
2020-10-06 19:41:45.675 c.d.s.e.b.IndexerBolt Thread-10-index-executor[3 3] [INFO] Indexing https://www.obozrevatel.com/ukr/dnipro/city/u-dnipri-ta-oblasti-ogolosili-shtormove-poperedzhennya.htm as https://www.obozrevatel.com/ukr/dnipro/city/u-dnipri-ta-oblasti-ogolosili-shtormove-poperedzhennya.htm
2020-10-06 19:41:46.191 c.d.s.b.JSoupParserBolt Thread-14-parse-executor[6 6] [INFO] Parsing : starting https://www.obozrevatel.com/ukr/dnipro/city/u-dnipri-ta-oblasti-ogolosili-shtormove-poperedzhennya/amp.htm
2020-10-06 19:41:46.202 c.d.s.b.JSoupParserBolt Thread-14-parse-executor[6 6] [INFO] Parsed https://www.obozrevatel.com/ukr/dnipro/city/u-dnipri-ta-oblasti-ogolosili-shtormove-poperedzhennya/amp.htm in 7 msec
2020-10-06 19:41:46.212 c.d.s.b.JSoupParserBolt Thread-14-parse-executor[6 6] [INFO] Total for https://www.obozrevatel.com/ukr/dnipro/city/u-dnipri-ta-oblasti-ogolosili-shtormove-poperedzhennya/amp.htm - 17 msec
2020-10-06 19:41:46.215 c.d.s.e.b.IndexerBolt Thread-10-index-executor[3 3] [INFO] Indexing https://www.obozrevatel.com/ukr/dnipro/city/u-dnipri-ta-oblasti-ogolosili-shtormove-poperedzhennya/amp.htm as https://www.obozrevatel.com/ukr/dnipro/city/u-dnipri-ta-oblasti-ogolosili-shtormove-poperedzhennya.htm
2020-10-06 19:41:46.754 c.d.s.e.b.IndexerBolt I/O dispatcher 12 [WARN] Could not find unacked tuple for 50571c0ffec7d295bb754b4847bdf2edace07885895ca09e5d459eeddd03c6f7
2020-10-06 19:51:40.108 c.d.s.e.b.IndexerBolt I/O dispatcher 12 [INFO] Bulk response [246] : items 100, waitAck 42, acked 99, failed 0
2020-10-06 19:51:43.985 c.d.s.w.WARCSpout Thread-4-spout-executor[8 8] [ERROR] Failed - unable to replay WARC record of: https://www.obozrevatel.com/ukr/dnipro/city/u-dnipri-ta-oblasti-ogolosili-shtormove-poperedzhennya.htm

Reduced to be more readable:

19:41:44.202 [INFO] Fetched A with status 200
19:41:44.351 [INFO] Fetched B with status 200
19:41:45.608 [INFO] Parsing : starting A
19:41:45.636 [INFO] Parsed A in 23 msec
19:41:45.674 [INFO] Total for A - 61 msec
19:41:45.675 [INFO] Indexing A as A
19:41:46.191 [INFO] Parsing : starting B
19:41:46.202 [INFO] Parsed B in 7 msec
19:41:46.212 [INFO] Total for B - 17 msec
19:41:46.215 [INFO] Indexing B as A
19:41:46.754 [WARN] Could not find unacked tuple for sha256sum(A)
19:51:40.108 [INFO] Bulk response [246] : items 100, waitAck 42, acked 99, failed 0
19:51:43.985 [ERROR] Failed - unable to replay WARC record of: A

Note: there is no prior Bulk response log message, so this means both pages/URLs have been processed in the first bulk. The hash is verified as sha256 hash of A by:

$> echo -n 'https://www.obozrevatel.com/ukr/dnipro/city/u-dnipri-ta-oblasti-ogolosili-shtormove-poperedzhennya.htm' \
    | sha256sum 
50571c0ffec7d295bb754b4847bdf2edace07885895ca09e5d459eeddd03c6f7  -

Issue Analytics

  • State:closed
  • Created 3 years ago
  • Comments:7 (4 by maintainers)

github_iconTop GitHub Comments

1reaction
sebastian-nagelcommented, Oct 7, 2020

Btw., docID is calculated twice (second time as sha256hex).

0reactions
jniochecommented, Oct 13, 2020

to put this issue into context: the worst thing that can happen is that the second URL (whichever it is) will eventually fail with a timeout but should get replayed by the spout and will eventually make it to the status and content indices. Of course, if the spout can’t replay the inputs, you’d lose it. But then a similar content for the same URL would have been indexed anyway

Read more comments on GitHub >

github_iconTop Results From Across the Web

Stormcrawler not fetching/indexing pages for elasticsearch
1 Answer 1 · Perfekt now it appears in the file but not in the 'index' of elastic search and also the FETCHED...
Read more >
What's new in StormCrawler 1.18 - DigitalPebble's Blog
Elasticsearch. Can't skip text or url fields in indexing #818. Elasticsearch IndexerBolt: tuples with canonical URL may not get acked #832.
Read more >
5 common mistakes with rel=canonical - Google Developers
Check that rel=canonical points to an existent URL with good content (that is, not a 404 , or worse, a soft 404 )....
Read more >
What could cause Google to not honor canonical URLs? - Moz
I have a strange situation on a website, when I do a Google query of site:example.com all the top indexed results appear to...
Read more >
Canonical URLs: A Beginner's Guide to Canonical Tags
Make sure that this is set to no so you will be able to manually specify the canonical URL for categories. Magento Custom...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found