question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

[URLFrontier] URLFrontier extension not returning ID preventing Status-ACK making crawling impossible

See original GitHub issue

Hello @jnioche,

we switched our URL and Status handling from a custom bolt to URL-frontier. But I recognized, that the Status-Bold is not acking any tuple. After going into the code, adding some log-events and cleaning up the async-code and improving the state-management, my unit test shows the following log:

12:35:05.187 [Time-limited test] INFO  c.d.s.u.StatusUpdaterBolt - Initialisation of connection to URLFrontier service on localhost:53770
12:35:05.187 [Time-limited test] INFO  c.d.s.u.StatusUpdaterBolt - Allowing up to 100000 message in flight
12:35:05.194 [Time-limited test] ERROR c.d.s.u.PartitionUtil - Unknown partition mode : null - forcing to byHost
12:35:05.194 [Time-limited test] INFO  c.d.s.u.URLPartitioner - Using partition mode : QUEUE_MODE_HOST
12:35:05.263 [Time-limited test] TRACE c.d.s.u.StatusUpdaterBolt - Added to waitAck https://www.url.net/something with ID https://www.url.net/something total 1 - sent to localhost:53770
12:35:05.751 [grpc-default-executor-1] WARN  c.d.s.u.StatusUpdaterBolt - Could not find unacked tuple for blank id ``. (Ack: )
12:35:05.752 [grpc-default-executor-1] TRACE c.d.s.u.StatusUpdaterBolt - Trace for unpacked tuple for blank id: 
12:35:10.787 [Time-limited test] INFO  c.d.s.u.ChannelManager - Shutting down channel ManagedChannelOrphanWrapper{delegate=ManagedChannelImpl{logId=1, target=localhost:53770}}

It looks like URL-Frontier does not provide an ID when responding to a put, this is an error that can not be fixed on the SC side. Without the ID the Status won’t be able to ACK a single tuple, making crawling basically impossible.

I added the used Unit-Tests ect. in this PR: https://github.com/DigitalPebble/storm-crawler/pull/980

Best Regards

Felix

Issue Analytics

  • State:closed
  • Created a year ago
  • Comments:6 (6 by maintainers)

github_iconTop GitHub Comments

1reaction
jniochecommented, Jul 6, 2022

Will close this issue as the underlying problem has been fixed already

1reaction
jniochecommented, Jul 4, 2022

I think the underlying issue has already been fixed in URLFrontier

https://github.com/crawler-commons/url-frontier/commit/ced0150d3a516ba8c8ad94b362fb8960ab2b35d6

I will release a new version of URLFrontier shortly.

Read more comments on GitHub >

github_iconTop Results From Across the Web

crawler-commons/url-frontier: API definition, resources and ...
The aim of the URL Frontier project is to develop a crawler/language-neutral API for the operations that web crawlers do when communicating with...
Read more >
The URL frontier - Stanford NLP Group
The URL frontier at a node is given a URL by its crawl process (or by the host splitter of another crawl process)....
Read more >
Product crawl issues - Google Ads Help
Google routinely crawls your mobile and desktop product pages and images to check for quality issues. If we're unable to perform these crawls, ......
Read more >
Mercator: A Scalable, Extensible Web Crawler - CiteSeerX
a component (called the URL frontier) for storing the list of URLs to download;. • a component for resolving host names into IP...
Read more >
Storing URL frontier and distributing work for web crawler?
When workers enqueue links, the database query to do so would avoid enqueuing links already seen. I've read a few articles about how...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found