Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Using invertedIndex for autocomplete

See original GitHub issue

Not so much an issue with lunr, which is great! More a quick try to get ideas going…

In a shell with the jq utility I pull my terms from the lunr index in advance: jq '[.index.invertedIndex[][0]|scan("^\\w{3,}")]|unique' index.json > iindex.json I can feed that to http://api.jqueryui.com/autocomplete/ widget like below

   function normalize(str) {
      var map = { "ä": "a", "ö": "o", "ü": "u", "ß": "ss" };
      return str.replace(/[^A-Za-z0-9]/g,
         function(a) { return map[a]||a; }
      );
   }
   $.getJSON('iindex.json', function (tags) {
      $('#query').autocomplete({
         minLength: 3,
         source: function(inp, out) {
            var t = normalize(inp.term);
            var r = $.ui.autocomplete.filter(tags, t);
            out(r);
         }
      });
   });

Not fully nice, but works acceptably so far. Now truly nice would be to create the autocomplete index on the client and have the term to match processed by the indexer instead of that crude normalizer above.

Issue Analytics

State:
Created 6 years ago
Comments:5 (3 by maintainers)

Top GitHub Comments

4reactions

chrisbartleycommented, Jan 16, 2019

First, a HUGE thanks to both @hungerburg and @olivernn for this. I combined both suggestions and it’s working great. For anyone wanting to do the same, this is what worked for me…

I’m indexing like this:

// Store unstemmed term in the metadata.  See:
// https://github.com/olivernn/lunr.js/issues/287#issuecomment-322573117
// https://lunrjs.com/guides/customising.html#token-meta-data
const storeUnstemmed = function(builder) {

   // Define a pipeline function that keeps the unstemmed word
   const pipelineFunction = function(token) {
      token.metadata['unstemmed'] = token.toString();
      return token;
   };

   // Register the pipeline function so the index can be serialised
   lunr.Pipeline.registerFunction(pipelineFunction, 'storeUnstemmed');

   // Add the pipeline function to both the indexing pipeline and the searching pipeline
   builder.pipeline.before(lunr.stemmer, pipelineFunction);

   // Whitelist the unstemmed metadata key
   builder.metadataWhitelist.push('unstemmed');
};

const index = lunr(function() {
   this.use(storeUnstemmed);
   ...
});

And modified the autocomplete function suggested by @hungerburg to use the unstemmed words like this:

autoComplete(searchTerm) {
   const results = this._index.query(function(q) {
      // exact matches should have the highest boost
      q.term(searchTerm, { boost : 100 })
      // wildcard matches should be boosted slightly
      q.term(searchTerm, {
         boost : 10,
         usePipeline : true,
         wildcard : lunr.Query.wildcard.LEADING | lunr.Query.wildcard.TRAILING
      })
      // finally, try a fuzzy search, without any boost
      q.term(searchTerm, { boost : 1, usePipeline : false, editDistance : 1 })
   });
   if (!results.length) {
      return "";
   }
   return results.map(function(v, i, a) { // extract unstemmed terms
      const unstemmedTerms = {};
      Object.keys(v.matchData.metadata).forEach(function(term) {
         Object.keys(v.matchData.metadata[term]).forEach(function(field) {
            v.matchData.metadata[term][field].unstemmed.forEach(function(word) {
               unstemmedTerms[word] = true;
            });
         });
      });
      return Object.keys(unstemmedTerms);
   }).reduce(function(a, b) { // flatten
      return a.concat(b);
   }).filter(function(v, i, a) { // uniq
      return a.indexOf(v) === i;
   });
}

Thanks!

3reactions

olivernncommented, Aug 7, 2017

Sorry for the late reply.

You could definitely wrap that normalise function up into a lunr plugin. There is a similar project, lunr-unicode-normalizer, but I don’t think it has been updated for lunr 2.

As for autocomplete, I need to get round to actually putting a demo of this together, but this is what I’ve been suggesting to people.

idx.query(function (q) {
  // exact matches should have the highest boost
  q.term(searchTerm, { boost: 100 })

  // prefix matches should be boosted slightly
  q.term(searchTerm, { boost: 10, usePipeline: false, wildcard: lunr.Query.wildcard.TRAILING })

  // finally, try a fuzzy search, without any boost
  q.term(searchTerm, { boost: 1, usePipeline: false, editDistance: 1 })
})

I disable the pipeline to prevent stemming getting in the way, you would have to experiment if this makes sense for your use case, especially if you wanted to add the unicode normalising plugin.

Additionally, when using the query method lunr won’t be doing any tokenisation for you, you can either handle this your self, or borrow the lunr.tokenizer directly, or its regex to split into individual tokens.

Top Results From Across the Web

A detailed comparison between autocompletion strategies in ...

With Elasticsearch's inverted index, this is fairly straightforward — return all documents that have android in the “platform” field.

The Awesome Power of the Inverted Index - Lucidworks

The inverted index is a wonder that helps find and make sense of information buried in mounds of data, text and binaries.

Understanding the Inverted Index in Elasticsearch

The purpose of an inverted index, is to store text in a structure that allows for very efficient and fast full-text searches. When...

How to structure an index for type ahead for extremely large ...

I've used this data structure for the exact auto-complete ... to tag each indexed record with a relevance score, which you can then...

Autocomplete with Elasticsearch - Part 2: Index-Time Search ...

The inverted index needs to store more data. We highly recommended reading the Definitive Guide, as there are additional examples, e.g. for zip ......