question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Default crawler does not run all Web Crawlers

See original GitHub issue

I have around 5K web crawlers configured with max_access = 32. When I start the default crawler, it only seems to to a portion of these - like maybe a 100 and then stops. The only suspicious thing I see in the logs is Future got interrupted. Otherwise it seems to look ok but doesn’t even touch most of the sites.

Issue Analytics

  • State:closed
  • Created 5 years ago
  • Comments:58 (18 by maintainers)

github_iconTop GitHub Comments

1reaction
marevolcommented, Jan 28, 2019

Change the following value in fess_config.properties:

page.web.config.max.fetch.size=100
0reactions
abolotnovcommented, Feb 5, 2019

More:

ubuntu@ip-172-31-20-132:~$ ps -u fess
   PID TTY          TIME CMD
 24406 ?        00:02:28 java
 30204 ?        00:07:15 java
ubuntu@ip-172-31-20-132:~$ ps huH p 24406|wc -l
161
ubuntu@ip-172-31-20-132:~$ ps huH p 30204|wc -l
232

Doesn’t look like it has anything to do with network settings. My crawler config is:

Simultaneous crawler configs 20 and each crawler configured at 3 threads. Doesn’t look like the network max connections issue.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Setting crawler configuration options - AWS Glue
Learn about how to configure what a crawler does when it encounters schema changes and partition changes in your data store.
Read more >
AWS Glue Crawler Not Creating Table - Stack Overflow
In my case, the problem was in the setting Crawler source type > Repeat crawls of S3 data stores , which I've set...
Read more >
How to configure your first crawler - Algolia
Running your first crawl takes a couple of minutes. The default configuration extracts some common attributes, such as a title, a description, headers, ......
Read more >
Crawler options - Pope Tech Help Center
Crawl Start Page – The website's base URL is the default crawl start page. Sometimes, the crawler finds more pages by changing where...
Read more >
Introduction to Siteimprove's crawler: an FAQ
By default, our servers crawl your website with a crawl frequency of 5 days. This means that 5 days after the scan has...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found