Index size bigger than the orginal data
See original GitHub issueIs it normal to get an index file which is larger (in size) than the original data file?
I have JSON file with ~ 1MB of size (about 58,278 words). When Trying to build an index for it using the following code:
const index_ar = new FlexSearch({
tokenize: "strict",
rtl: true,
split: /\s+/,
doc: {
id: "id",
field: [
'title',
'incident_date_time',
'location:name'
]
}
});
index_ar.add(data);
The index file size is ~ 2.1MB! I inspected the file size using the following method:
const exportedIndex = index_ar.export();
fs.writeFileSync('exported.json', JSON.stringify(exportedIndex));
Is there any wrong in the code? or it’s normal to get an index size bigger than original data?
Issue Analytics
- State:
- Created 4 years ago
- Comments:7 (4 by maintainers)
Top Results From Across the Web
Is it bad to have index space larger than data space?
I don't usually measure it in terms of size - I usually think of it in terms of index quantity, but size would...
Read more >What to Do When the Index is Larger Than the SQL Table
There are several causes that increase the size of indexes. Too many indexes in the same columns. First of all, analyze your indexes....
Read more >Why are your indexes larger than your actual data?
1) Too many indexes · 2) Indexes on big columns – like varchar(255) · 3) Redundant or duplicate indexes · 4) Combination of...
Read more >what to do if index size larger than data size - MSDN - Microsoft
what to do if index size larger than data size here is a big table, say 80,000,000 rows, with six or seven indexes,...
Read more >Indexing Very Large Tables - Towards Data Science
Creating and maintaining an index on a huge table is costlier than on smaller tables. Whenever you create an index, a copy of...
Read more >
Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free
Top Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
I would say that it is expected that the index size is always larger as the original data, because the index will duplicate data to be able to search fast.
Think of it as adding a traditional index to an existing book: for each word that occurs in the book, you add a list on which pages it occurs. Now your book is several pages thicker!
This is a really nice option, thanks for the great work.
I want to emphasize again on that particular use case, when using very basic functionalities from FlexSearch; (“strict” tokenizer, non-contextual, split by words). In this case, I don’t see any difference (in performance and data size) between:
What benefits of using FlexSearch here?