Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Indexed documents vs crawled documents

See original GitHub issue

Hi,

Any idea why i have only a small number of indexed documents compared to crawled documents. Fess is showing i have 200k+ documents but in reality i only have 12k+ in my index that i can search. I cannot search for all 200k documents.

The crawling job has finished and there is nothing else happening. How can i check what happened to the other documents, i have tried the logs already. At this point this is the biggest issue i`m facing , how can i get that number closer to the number of crawled documents

fess_dashboard .

Issue Analytics

State:
Created 7 years ago
Comments:8 (4 by maintainers)

Top GitHub Comments

1reaction

marevolcommented, Sep 9, 2016

JVM options for Crawler are in fess_config.properties:

jvm.crawler.options=\
-Djava.awt.headless=true\n\
-server\n\
-Xmx512m\n\
-XX:MaxMetaspaceSize=128m\n\
-XX:CompressedClassSpaceSize=32m\n\
-XX:-UseGCOverheadLimit\n\
-XX:+UseConcMarkSweepGC\n\
-XX:CMSInitiatingOccupancyFraction=75\n\
-XX:+UseParNewGC\n\
-XX:+UseTLAB\n\
-XX:+DisableExplicitGC\n\
-XX:-OmitStackTraceInFastThrow\n\
-Djcifs.smb.client.connTimeout=60000\n\
-Djcifs.smb.client.soTimeout=35000\n\
-Djcifs.smb.client.responseTimeout=30000\n\
-Dgroovy.use.classvalue=true\n\

1reaction

marevolcommented, Sep 8, 2016

To do a remote debug, in Admin > System > Scheduler > Default Crawler, change script to

return container.getComponent("crawlJob").logLevel("info").remoteDebug().execute(executor);

and also change a log level to “debug”.

Top Results From Across the Web

Crawling and Indexing - Google

As the search appliance crawls public content sources, it indexes documents that it finds. To find more documents, the crawler follows links within...

What is Crawling and Indexing?

Crawling is the discovery of pages and links that lead to more pages. Indexing is storing, analyzing, and organizing the content and connections ......

In-Depth Guide to How Google Search Works | Documentation

Get an in-depth understanding of how Google Search works and improve your site for Google's crawling, indexing, and ranking processes.

Crawling and Indexing Your Search Collection - IBM

The Watson™ Explorer Engine crawls and indexes the documents in a search collection in order to be able to quickly and flexibly search...

Crawling and Indexing Content Sources

Crawling is straightforward, it's indexing that demands engineering skills and lots of innovation and creativity. Consider these four crawled documents.