[URLFrontier] URLFrontier extension not returning ID preventing Status-ACK making crawling impossible
See original GitHub issueHello @jnioche,
we switched our URL and Status handling from a custom bolt to URL-frontier. But I recognized, that the Status-Bold is not acking any tuple. After going into the code, adding some log-events and cleaning up the async-code and improving the state-management, my unit test shows the following log:
12:35:05.187 [Time-limited test] INFO c.d.s.u.StatusUpdaterBolt - Initialisation of connection to URLFrontier service on localhost:53770
12:35:05.187 [Time-limited test] INFO c.d.s.u.StatusUpdaterBolt - Allowing up to 100000 message in flight
12:35:05.194 [Time-limited test] ERROR c.d.s.u.PartitionUtil - Unknown partition mode : null - forcing to byHost
12:35:05.194 [Time-limited test] INFO c.d.s.u.URLPartitioner - Using partition mode : QUEUE_MODE_HOST
12:35:05.263 [Time-limited test] TRACE c.d.s.u.StatusUpdaterBolt - Added to waitAck https://www.url.net/something with ID https://www.url.net/something total 1 - sent to localhost:53770
12:35:05.751 [grpc-default-executor-1] WARN c.d.s.u.StatusUpdaterBolt - Could not find unacked tuple for blank id ``. (Ack: )
12:35:05.752 [grpc-default-executor-1] TRACE c.d.s.u.StatusUpdaterBolt - Trace for unpacked tuple for blank id:
12:35:10.787 [Time-limited test] INFO c.d.s.u.ChannelManager - Shutting down channel ManagedChannelOrphanWrapper{delegate=ManagedChannelImpl{logId=1, target=localhost:53770}}
It looks like URL-Frontier does not provide an ID when responding to a put, this is an error that can not be fixed on the SC side. Without the ID the Status won’t be able to ACK a single tuple, making crawling basically impossible.
I added the used Unit-Tests ect. in this PR: https://github.com/DigitalPebble/storm-crawler/pull/980
Best Regards
Felix
Issue Analytics
- State:
- Created a year ago
- Comments:6 (6 by maintainers)
Top Results From Across the Web
crawler-commons/url-frontier: API definition, resources and ...
The aim of the URL Frontier project is to develop a crawler/language-neutral API for the operations that web crawlers do when communicating with...
Read more >The URL frontier - Stanford NLP Group
The URL frontier at a node is given a URL by its crawl process (or by the host splitter of another crawl process)....
Read more >Product crawl issues - Google Ads Help
Google routinely crawls your mobile and desktop product pages and images to check for quality issues. If we're unable to perform these crawls, ......
Read more >Mercator: A Scalable, Extensible Web Crawler - CiteSeerX
a component (called the URL frontier) for storing the list of URLs to download;. • a component for resolving host names into IP...
Read more >Storing URL frontier and distributing work for web crawler?
When workers enqueue links, the database query to do so would avoid enqueuing links already seen. I've read a few articles about how...
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
Will close this issue as the underlying problem has been fixed already
I think the underlying issue has already been fixed in URLFrontier
https://github.com/crawler-commons/url-frontier/commit/ced0150d3a516ba8c8ad94b362fb8960ab2b35d6
I will release a new version of URLFrontier shortly.