question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Indexed documents vs crawled documents

See original GitHub issue

Hi,

Any idea why i have only a small number of indexed documents compared to crawled documents. Fess is showing i have 200k+ documents but in reality i only have 12k+ in my index that i can search. I cannot search for all 200k documents.

The crawling job has finished and there is nothing else happening. How can i check what happened to the other documents, i have tried the logs already. At this point this is the biggest issue i`m facing , how can i get that number closer to the number of crawled documents

fess_dashboard .

Issue Analytics

  • State:closed
  • Created 7 years ago
  • Comments:8 (4 by maintainers)

github_iconTop GitHub Comments

1reaction
marevolcommented, Sep 9, 2016

JVM options for Crawler are in fess_config.properties:

jvm.crawler.options=\
-Djava.awt.headless=true\n\
-server\n\
-Xmx512m\n\
-XX:MaxMetaspaceSize=128m\n\
-XX:CompressedClassSpaceSize=32m\n\
-XX:-UseGCOverheadLimit\n\
-XX:+UseConcMarkSweepGC\n\
-XX:CMSInitiatingOccupancyFraction=75\n\
-XX:+UseParNewGC\n\
-XX:+UseTLAB\n\
-XX:+DisableExplicitGC\n\
-XX:-OmitStackTraceInFastThrow\n\
-Djcifs.smb.client.connTimeout=60000\n\
-Djcifs.smb.client.soTimeout=35000\n\
-Djcifs.smb.client.responseTimeout=30000\n\
-Dgroovy.use.classvalue=true\n\
1reaction
marevolcommented, Sep 8, 2016

To do a remote debug, in Admin > System > Scheduler > Default Crawler, change script to

return container.getComponent("crawlJob").logLevel("info").remoteDebug().execute(executor);

and also change a log level to “debug”.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Crawling and Indexing - Google
As the search appliance crawls public content sources, it indexes documents that it finds. To find more documents, the crawler follows links within...
Read more >
What is Crawling and Indexing?
Crawling is the discovery of pages and links that lead to more pages. Indexing is storing, analyzing, and organizing the content and connections ......
Read more >
In-Depth Guide to How Google Search Works | Documentation
Get an in-depth understanding of how Google Search works and improve your site for Google's crawling, indexing, and ranking processes.
Read more >
Crawling and Indexing Your Search Collection - IBM
The Watson™ Explorer Engine crawls and indexes the documents in a search collection in order to be able to quickly and flexibly search...
Read more >
Crawling and Indexing Content Sources
Crawling is straightforward, it's indexing that demands engineering skills and lots of innovation and creativity. Consider these four crawled documents.
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found