Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Lunr with a large index (800,000 items)

See original GitHub issue

I’m trying to use lunr to index CCEDICT in the browser. This takes over 15 seconds. I’ve attached a CPU Profile, it looks like most of the time is spent in lunr.SortedSet.add with a bunch in lunr.TokenStore.add and a big chunk of garbage collection. Is there anything I can do to speed up my indexing? Thanks!

My indexing code:

  var lineRegex = /(\S+)\s+(\S+)\s+\[([^\]]*)\]\s+\/(.*)\/\s*$/;
  function parseCCEDICT(data) {
    dictionaryStore = {};
    dictionaryIndex = lunr(function () {
      this.field('simplified')
      this.field('traditional')
      this.field('pronunciation')
      this.field('definitions')
      this.ref('id')
    })

    var lines = data.split('\n');
    for (var i=0; i<lines.length; i++l) {
      var line = lines[i];
      if (line.startsWith('#') || line === '') {
        continue; // skip comments and blanks
      }
      var match = lineRegex.exec(line);
      if (match !== null) {
        entry = {
            simplified: match[1],
            traditional: match[2],
            pronunciation: match[3],
            definitions: match[4].split('/'),
            id: i
        }
        dictionaryIndex.add(entry);
        dictionaryStore[i] = entry;
      } else { console.log("Invalid line format for line" + line); }
    }
  }

CCEDICT: http://www.mdbg.net/chindict/chindict.php?page=cedict

CPU-20160727T211933.cpuprofile.txt

Issue Analytics

State:
Created 7 years ago
Reactions:1
Comments:7 (3 by maintainers)

Top GitHub Comments

5reactions

olivernncommented, Jul 27, 2016

Thanks for the detailed issue!

I’m sure 15 seconds waiting to index in the browser seems like a lifetime, especially if this is done on the UI thread. That said, if the profile you included in the issue is for the full 800K documents, that works out at 22K ‘inserts’ per second (the CPU profile seemed to indicate indexing took ~35s). I don’t think that is too bad, even if it is all in memory.

The simplest thing you can do is just not do the indexing in the browser, you can pre-build the index, assuming it is fairly static data, and then just load the serialised index. This doesn’t make the indexing any quicker, it just caches the result. If you can’t pre-build the index, then you can (and probably should) perform the indexing in a worker thread if possible.

Okay, with the two ‘easy’ options out of the way that leaves us with doing some real work. The CPU profile indicates that lunr.TokenStore#add is dominating the time. If you look at the source it is relatively simple. The token store is just a trie, for each token the characters are iterated and inserted into the tree structure.

I can think of a couple of things to possibly make this method quicker:

Remove the recursion, there is a non-trivial overhead in calling functions, this would be the first thing I would try
Stop keeping track of the number of tokens stored in the tree
Cache the lookup of root[key], its accessed twice on every function call and the object might end up containing a lot of keys

I would imagine that removing the recursion would be the biggest gain here, that said, I’m doubtful (though hopeful) that there is much scope for improvement, it looks like each individual call to the function is taking 0.1ms.

Let me know how those suggestions above work out, and if you do see some improvements then please submit a PR.

As an aside, thanks for introducing me to an interesting data set that I can use for stress testing lunr!

2reactions

astrommecommented, Jul 28, 2016

Thanks for the suggestions! Yep, the profile was for building the entire index, plus a few seconds overhead for loading the original dict file into memory.

For others following along in the future, I used node.js to create and dump my index, see code at the end of this comment. This resulted in a 61 MB index.json and a 18 MB store.json, compressed to 9.5 MB and 4.5 MB, respectively, with gzip. The original CCEDICT is 8.4 MB uncompressed, 3.3 MB compressed.

Reading back in the index.json and setting up a Lunr object on my Macbook 12 (1.2Ghz, 2016) takes 5.6 seconds. Vastly better than the 35 seconds, though I would love to find a way to further reduce the time.

dump.js

var fs = require('fs');
var lunr = require('./lunr.js');
var path = require('path');

var filePath = path.join(__dirname, 'dicts/cedict_1_0_ts_utf-8_mdbg.txt');

fs.readFile(filePath, {encoding: 'utf-8'}, function(err,data){
    if (!err){
      var dict = parseCCEDICT(data);
      console.log("dumping dictionary");
      fs.writeFile("index.json", JSON.stringify(dict.index), function(err) {
          if(err) {
              return console.log(err);
          }
          console.log("wrote index.json");
      });
      fs.writeFile("store.json", JSON.stringify(dict.store), function(err) {
          if(err) {
              return console.log(err);
          }
          console.log("wrote store.json");
      });
    }else{
      console.log(err);
    }

});

var lineRegex = /(\S+)\s+(\S+)\s+\[([^\]]*)\]\s+\/(.*)\/\s*$/;

function parseCCEDICT(data) {
  dictionaryStore = {};
  dictionaryIndex = lunr(function () {
    this.field('simplified')
    this.field('traditional')
    this.field('pronunciation')
    this.field('definitions')
    this.ref('id')
  })

  var lines = data.split('\n');
  for (var i=0; i<lines.length; i++) {
    if (i % 10000 == 0) {
      console.log("finished with " + i);
    }
    var line = lines[i];
    if (line.startsWith('#') || line === '') {
      continue; // skip comments and blanks
    }
    var match = lineRegex.exec(line);
    if (match !== null) {
      entry = {
          simplified: match[1],
          traditional: match[2],
          pronunciation: match[3],
          definitions: match[4].split('/'),
          id: i
      }
      dictionaryIndex.add(entry);
      dictionaryStore[i] = entry;
    } else { console.log("Invalid line format for line" + line); }
  }

  return { index: dictionaryIndex, store: dictionaryStore };
}

load.js (first you need to gzip the dumped index file)

var fs = require('fs');
var lunr = require('../firebase/public/lunr.js');
var path = require('path');
var zlib = require('zlib');

const gunzip = zlib.createGunzip();

index = {};


fs.readFile('index.json.gz', function(err,data){
    if (!err){
      zlib.gunzip(data, (err, buffer) => {
        if (!err) {
          console.log("ungzipped");
          var indexDump = JSON.parse(buffer);
          index = lunr.Index.load(indexDump);
          console.log(index.search("hello"));
          console.log("index complete");
        } else {
          console.log("error ungzipping");
        }
      });

    }else{
      console.log(err);
    }

});