Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Performance of reindexing

See original GitHub issue

Hello,

first off - very nice tool 😄 I’ve played around today a bit with this crawler (combined with Tika + OCR tesseract).

The initial indexing of 1,5GB (8.000 files) took a while - which is fine of course.

My main problem is currently, that the “reindexing” takes more time then i thought. For those 8.000 files it took about 2 minutes.

Is there any possibility to speed up that part? Configuration or similar?

Does it currently compare the file modification timestamp with the lastrun timestamp? Or is it another approach?

Thanks in advance for any information 👍

Issue Analytics

State:
Created 7 years ago
Comments:8 (4 by maintainers)

Top GitHub Comments

1reaction

ThaDafinsercommented, Nov 22, 2016

Some ideas releated to this topic.

split the metadata + content extraction similar to tika-server - Just provide an REST/HTTP API for it.
this would have the benefit, that you can create a custom crawler (or switch), but still use the same index and indexing mechanism

With this split, it’s possible to create a general crawler for all systems and specialized ones for unix/windows.

0reactions

dadoonetcommented, Feb 6, 2017

Not for now. So basically you would like to be able to read any parameter either from settings or from the command line.

That makes sense to me. Can you open an issue for that?