Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Index size bigger than the orginal data

See original GitHub issue

Is it normal to get an index file which is larger (in size) than the original data file?

I have JSON file with ~ 1MB of size (about 58,278 words). When Trying to build an index for it using the following code:

const index_ar = new FlexSearch({
  tokenize: "strict",
  rtl: true,
  split: /\s+/,
  doc: {
    id: "id",
    field: [
      'title',
      'incident_date_time',
      'location:name'
    ]
  }
});
 index_ar.add(data);

The index file size is ~ 2.1MB! I inspected the file size using the following method:

const exportedIndex = index_ar.export();
fs.writeFileSync('exported.json', JSON.stringify(exportedIndex));

Is there any wrong in the code? or it’s normal to get an index size bigger than original data?

Issue Analytics

State:
Created 4 years ago
Comments:7 (4 by maintainers)

Top GitHub Comments

1reaction

desjobcommented, Jul 16, 2019

I would say that it is expected that the index size is always larger as the original data, because the index will duplicate data to be able to search fast.

Think of it as adding a traditional index to an existing book: for each word that occurs in the book, you add a list on which pages it occurs. Now your book is several pages thicker!

0reactions

tareefdevcommented, Aug 2, 2019

This is a really nice option, thanks for the great work.

I want to emphasize again on that particular use case, when using very basic functionalities from FlexSearch; (“strict” tokenizer, non-contextual, split by words). In this case, I don’t see any difference (in performance and data size) between: