question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Performance of reindexing

See original GitHub issue

Hello,

first off - very nice tool 😄 I’ve played around today a bit with this crawler (combined with Tika + OCR tesseract).

The initial indexing of 1,5GB (8.000 files) took a while - which is fine of course.

My main problem is currently, that the “reindexing” takes more time then i thought. For those 8.000 files it took about 2 minutes.

Is there any possibility to speed up that part? Configuration or similar?

Does it currently compare the file modification timestamp with the lastrun timestamp? Or is it another approach?

Thanks in advance for any information 👍

Issue Analytics

  • State:closed
  • Created 7 years ago
  • Comments:8 (4 by maintainers)

github_iconTop GitHub Comments

1reaction
ThaDafinsercommented, Nov 22, 2016

Some ideas releated to this topic.

  • split the metadata + content extraction similar to tika-server - Just provide an REST/HTTP API for it.
  • this would have the benefit, that you can create a custom crawler (or switch), but still use the same index and indexing mechanism

With this split, it’s possible to create a general crawler for all systems and specialized ones for unix/windows.

0reactions
dadoonetcommented, Feb 6, 2017

Not for now. So basically you would like to be able to read any parameter either from settings or from the command line.

That makes sense to me. Can you open an issue for that?

Read more comments on GitHub >

github_iconTop Results From Across the Web

Tips to Improve your Elasticsearch Reindex Performance
Improve your Elasticsearch Reindex Performance with these Tips ... Disable replicas when building a new index from scratch that is not serving the...
Read more >
How to reindex over 120M documents in one hour at Compass
Here, we will focus on the first two, where you should expect to see a decent reindexing performance boost without additional a hardware...
Read more >
How reindexing/rebalancing works, and the impact on ...
Performance Impact. Reindexing is a resource-intensive operation, as it uses both CPU and disk bandwidth. The CPU will be busy parsing the ...
Read more >
Reindexing Performance - Elasticsearch - Elastic Discuss
Hi,. We are having problems with reindexing our data. Can you advice what to do on how to reindex while not affecting search...
Read more >
How to Reindex One Billion Documents in One Hour at ...
The shown performance improvements helped to cut down the reindexing time for new clusters from one week to one hour, thereby enabling the ......
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found