Lunr with a large index (800,000 items)
See original GitHub issueI’m trying to use lunr to index CCEDICT in the browser. This takes over 15 seconds. I’ve attached a CPU Profile, it looks like most of the time is spent in lunr.SortedSet.add with a bunch in lunr.TokenStore.add and a big chunk of garbage collection. Is there anything I can do to speed up my indexing? Thanks!
My indexing code:
var lineRegex = /(\S+)\s+(\S+)\s+\[([^\]]*)\]\s+\/(.*)\/\s*$/;
function parseCCEDICT(data) {
dictionaryStore = {};
dictionaryIndex = lunr(function () {
this.field('simplified')
this.field('traditional')
this.field('pronunciation')
this.field('definitions')
this.ref('id')
})
var lines = data.split('\n');
for (var i=0; i<lines.length; i++l) {
var line = lines[i];
if (line.startsWith('#') || line === '') {
continue; // skip comments and blanks
}
var match = lineRegex.exec(line);
if (match !== null) {
entry = {
simplified: match[1],
traditional: match[2],
pronunciation: match[3],
definitions: match[4].split('/'),
id: i
}
dictionaryIndex.add(entry);
dictionaryStore[i] = entry;
} else { console.log("Invalid line format for line" + line); }
}
}
CCEDICT: http://www.mdbg.net/chindict/chindict.php?page=cedict
Issue Analytics
- State:
- Created 7 years ago
- Reactions:1
- Comments:7 (3 by maintainers)
Top Results From Across the Web
Pre-building Indexes - Lunr.js
This technique is useful with large indexes, or with documents that are largely static, such as with a static website. Serialization. Lunr indexes...
Read more >Quick and easy client-side JavaScript search with Lunr.js
Quick and easy client-side JavaScript search with Lunr.js - Display search results from any JSON data object in the browser using Lunr.js.
Read more >How to Add Lunr Search to Your Gatsby Website - CSS-Tricks
In this article, we will build a search index and add search functionality to a Gatsby website with Lunr, a lightweight JavaScript library ......
Read more >Implement client-side search on your website with this ...
But you can actually use Lunr.js to search any array of JavaScript objects. In this how-to, I build a search index for the...
Read more >Gatsby site search with Lunr.js - For the practical developer
Now that we have a search index to work with, we need to build some components that can consume it. Building our components....
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
Thanks for the detailed issue!
I’m sure 15 seconds waiting to index in the browser seems like a lifetime, especially if this is done on the UI thread. That said, if the profile you included in the issue is for the full 800K documents, that works out at 22K ‘inserts’ per second (the CPU profile seemed to indicate indexing took ~35s). I don’t think that is too bad, even if it is all in memory.
The simplest thing you can do is just not do the indexing in the browser, you can pre-build the index, assuming it is fairly static data, and then just load the serialised index. This doesn’t make the indexing any quicker, it just caches the result. If you can’t pre-build the index, then you can (and probably should) perform the indexing in a worker thread if possible.
Okay, with the two ‘easy’ options out of the way that leaves us with doing some real work. The CPU profile indicates that lunr.TokenStore#add is dominating the time. If you look at the source it is relatively simple. The token store is just a trie, for each token the characters are iterated and inserted into the tree structure.
I can think of a couple of things to possibly make this method quicker:
root[key]
, its accessed twice on every function call and the object might end up containing a lot of keysI would imagine that removing the recursion would be the biggest gain here, that said, I’m doubtful (though hopeful) that there is much scope for improvement, it looks like each individual call to the function is taking 0.1ms.
Let me know how those suggestions above work out, and if you do see some improvements then please submit a PR.
As an aside, thanks for introducing me to an interesting data set that I can use for stress testing lunr!
Thanks for the suggestions! Yep, the profile was for building the entire index, plus a few seconds overhead for loading the original dict file into memory.
For others following along in the future, I used node.js to create and dump my index, see code at the end of this comment. This resulted in a 61 MB index.json and a 18 MB store.json, compressed to 9.5 MB and 4.5 MB, respectively, with gzip. The original CCEDICT is 8.4 MB uncompressed, 3.3 MB compressed.
Reading back in the index.json and setting up a Lunr object on my Macbook 12 (1.2Ghz, 2016) takes 5.6 seconds. Vastly better than the 35 seconds, though I would love to find a way to further reduce the time.
dump.js
load.js (first you need to gzip the dumped index file)