question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Lunr with a large index (800,000 items)

See original GitHub issue

I’m trying to use lunr to index CCEDICT in the browser. This takes over 15 seconds. I’ve attached a CPU Profile, it looks like most of the time is spent in lunr.SortedSet.add with a bunch in lunr.TokenStore.add and a big chunk of garbage collection. Is there anything I can do to speed up my indexing? Thanks!

My indexing code:

  var lineRegex = /(\S+)\s+(\S+)\s+\[([^\]]*)\]\s+\/(.*)\/\s*$/;
  function parseCCEDICT(data) {
    dictionaryStore = {};
    dictionaryIndex = lunr(function () {
      this.field('simplified')
      this.field('traditional')
      this.field('pronunciation')
      this.field('definitions')
      this.ref('id')
    })

    var lines = data.split('\n');
    for (var i=0; i<lines.length; i++l) {
      var line = lines[i];
      if (line.startsWith('#') || line === '') {
        continue; // skip comments and blanks
      }
      var match = lineRegex.exec(line);
      if (match !== null) {
        entry = {
            simplified: match[1],
            traditional: match[2],
            pronunciation: match[3],
            definitions: match[4].split('/'),
            id: i
        }
        dictionaryIndex.add(entry);
        dictionaryStore[i] = entry;
      } else { console.log("Invalid line format for line" + line); }
    }
  }

CCEDICT: http://www.mdbg.net/chindict/chindict.php?page=cedict

CPU-20160727T211933.cpuprofile.txt

Issue Analytics

  • State:open
  • Created 7 years ago
  • Reactions:1
  • Comments:7 (3 by maintainers)

github_iconTop GitHub Comments

5reactions
olivernncommented, Jul 27, 2016

Thanks for the detailed issue!

I’m sure 15 seconds waiting to index in the browser seems like a lifetime, especially if this is done on the UI thread. That said, if the profile you included in the issue is for the full 800K documents, that works out at 22K ‘inserts’ per second (the CPU profile seemed to indicate indexing took ~35s). I don’t think that is too bad, even if it is all in memory.

The simplest thing you can do is just not do the indexing in the browser, you can pre-build the index, assuming it is fairly static data, and then just load the serialised index. This doesn’t make the indexing any quicker, it just caches the result. If you can’t pre-build the index, then you can (and probably should) perform the indexing in a worker thread if possible.

Okay, with the two ‘easy’ options out of the way that leaves us with doing some real work. The CPU profile indicates that lunr.TokenStore#add is dominating the time. If you look at the source it is relatively simple. The token store is just a trie, for each token the characters are iterated and inserted into the tree structure.

I can think of a couple of things to possibly make this method quicker:

  1. Remove the recursion, there is a non-trivial overhead in calling functions, this would be the first thing I would try
  2. Stop keeping track of the number of tokens stored in the tree
  3. Cache the lookup of root[key], its accessed twice on every function call and the object might end up containing a lot of keys

I would imagine that removing the recursion would be the biggest gain here, that said, I’m doubtful (though hopeful) that there is much scope for improvement, it looks like each individual call to the function is taking 0.1ms.

Let me know how those suggestions above work out, and if you do see some improvements then please submit a PR.

As an aside, thanks for introducing me to an interesting data set that I can use for stress testing lunr!

2reactions
astrommecommented, Jul 28, 2016

Thanks for the suggestions! Yep, the profile was for building the entire index, plus a few seconds overhead for loading the original dict file into memory.

For others following along in the future, I used node.js to create and dump my index, see code at the end of this comment. This resulted in a 61 MB index.json and a 18 MB store.json, compressed to 9.5 MB and 4.5 MB, respectively, with gzip. The original CCEDICT is 8.4 MB uncompressed, 3.3 MB compressed.

Reading back in the index.json and setting up a Lunr object on my Macbook 12 (1.2Ghz, 2016) takes 5.6 seconds. Vastly better than the 35 seconds, though I would love to find a way to further reduce the time.

dump.js

var fs = require('fs');
var lunr = require('./lunr.js');
var path = require('path');

var filePath = path.join(__dirname, 'dicts/cedict_1_0_ts_utf-8_mdbg.txt');

fs.readFile(filePath, {encoding: 'utf-8'}, function(err,data){
    if (!err){
      var dict = parseCCEDICT(data);
      console.log("dumping dictionary");
      fs.writeFile("index.json", JSON.stringify(dict.index), function(err) {
          if(err) {
              return console.log(err);
          }
          console.log("wrote index.json");
      });
      fs.writeFile("store.json", JSON.stringify(dict.store), function(err) {
          if(err) {
              return console.log(err);
          }
          console.log("wrote store.json");
      });
    }else{
      console.log(err);
    }

});

var lineRegex = /(\S+)\s+(\S+)\s+\[([^\]]*)\]\s+\/(.*)\/\s*$/;

function parseCCEDICT(data) {
  dictionaryStore = {};
  dictionaryIndex = lunr(function () {
    this.field('simplified')
    this.field('traditional')
    this.field('pronunciation')
    this.field('definitions')
    this.ref('id')
  })

  var lines = data.split('\n');
  for (var i=0; i<lines.length; i++) {
    if (i % 10000 == 0) {
      console.log("finished with " + i);
    }
    var line = lines[i];
    if (line.startsWith('#') || line === '') {
      continue; // skip comments and blanks
    }
    var match = lineRegex.exec(line);
    if (match !== null) {
      entry = {
          simplified: match[1],
          traditional: match[2],
          pronunciation: match[3],
          definitions: match[4].split('/'),
          id: i
      }
      dictionaryIndex.add(entry);
      dictionaryStore[i] = entry;
    } else { console.log("Invalid line format for line" + line); }
  }

  return { index: dictionaryIndex, store: dictionaryStore };
}

load.js (first you need to gzip the dumped index file)

var fs = require('fs');
var lunr = require('../firebase/public/lunr.js');
var path = require('path');
var zlib = require('zlib');

const gunzip = zlib.createGunzip();

index = {};


fs.readFile('index.json.gz', function(err,data){
    if (!err){
      zlib.gunzip(data, (err, buffer) => {
        if (!err) {
          console.log("ungzipped");
          var indexDump = JSON.parse(buffer);
          index = lunr.Index.load(indexDump);
          console.log(index.search("hello"));
          console.log("index complete");
        } else {
          console.log("error ungzipping");
        }
      });

    }else{
      console.log(err);
    }

});
Read more comments on GitHub >

github_iconTop Results From Across the Web

Pre-building Indexes - Lunr.js
This technique is useful with large indexes, or with documents that are largely static, such as with a static website. Serialization. Lunr indexes...
Read more >
Quick and easy client-side JavaScript search with Lunr.js
Quick and easy client-side JavaScript search with Lunr.js - Display search results from any JSON data object in the browser using Lunr.js.
Read more >
How to Add Lunr Search to Your Gatsby Website - CSS-Tricks
In this article, we will build a search index and add search functionality to a Gatsby website with Lunr, a lightweight JavaScript library ......
Read more >
Implement client-side search on your website with this ...
But you can actually use Lunr.js to search any array of JavaScript objects. In this how-to, I build a search index for the...
Read more >
Gatsby site search with Lunr.js - For the practical developer
Now that we have a search index to work with, we need to build some components that can consume it. Building our components....
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found