Default crawler does not run all Web Crawlers
See original GitHub issueI have around 5K web crawlers configured with max_access
= 32. When I start the default crawler, it only seems to to a portion of these - like maybe a 100 and then stops. The only suspicious thing I see in the logs is Future got interrupted
. Otherwise it seems to look ok but doesn’t even touch most of the sites.
Issue Analytics
- State:
- Created 5 years ago
- Comments:58 (18 by maintainers)
Top Results From Across the Web
Setting crawler configuration options - AWS Glue
Learn about how to configure what a crawler does when it encounters schema changes and partition changes in your data store.
Read more >AWS Glue Crawler Not Creating Table - Stack Overflow
In my case, the problem was in the setting Crawler source type > Repeat crawls of S3 data stores , which I've set...
Read more >How to configure your first crawler - Algolia
Running your first crawl takes a couple of minutes. The default configuration extracts some common attributes, such as a title, a description, headers, ......
Read more >Crawler options - Pope Tech Help Center
Crawl Start Page – The website's base URL is the default crawl start page. Sometimes, the crawler finds more pages by changing where...
Read more >Introduction to Siteimprove's crawler: an FAQ
By default, our servers crawl your website with a crawl frequency of 5 days. This means that 5 days after the scan has...
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
Change the following value in fess_config.properties:
More:
Doesn’t look like it has anything to do with network settings. My crawler config is:
Simultaneous crawler configs
20 and each crawler configured at 3 threads. Doesn’t look like the network max connections issue.