question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

More memory efficient hash tables

See original GitHub issue

Currently hash tables are implemented with a Cython wrapper over the klib library. As far as I can tell, the array in which klib uses to store the values is sparse. If large values are being stored in the hash table, this results in memory inefficiency. eg.

complex128 values = [----------, laaaaarge1, ----------, ----------, laaaaarge2, ----------]

  • size = 128 * 6 = 768 bits

this inefficiency could be solved by storing the large values in a dense array, and instead storing the indexes to the values in the hash table.

uint16 indexes = [-, 0, -, -, 1, -]
complex128 values = [laaaaarge1, laaaaarge2]
  • size: 16 * 6 + 128 * 2 = 352 bits <-- 50% smaller

More generally, the space savings would be (val_size - index_size) * n_bucketsn_vals * val_size

Because this would save memory, it would allow for larger hash tables, allowing for fewer collisions and better speed. This would likely outweigh any performance impairments from the additional array access (which is fast because the arrays are Cython arrays).

However, I am not sure what the values of val_size, n_vals, and n_buckets generally are for Pandas/klib and would appreciate any insight on whether this proposal would actually result in a performance improvement.

Issue Analytics

  • State:open
  • Created 6 years ago
  • Comments:9 (9 by maintainers)

github_iconTop GitHub Comments

1reaction
jrebackcommented, May 23, 2017

@rohanp you could try this. I think this might involve a fairly large change (in cython), and you would have to measure the memory savings AND make sure perf doesn’t degrade too much (as now you are doing a double lookup, though should not be by much as the 2nd access is an array indexing op which is pretty fast).

0reactions
jrebackcommented, May 23, 2017

@rohanp having separate keys/values is much easier impl wise. We just use keys of the appropriate dtype. the values are always Py_ssize_t, this just makes things much simpler. so we are only paying the power-of-two bucket cost once.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Open-sourcing F14 for faster, more memory-efficient hash tables
Hash tables provide a fast way to maintain a set of keys or map keys to values, even if the keys are objects,...
Read more >
Why does a hash table take up more memory than other data ...
In order to make hash table fast, which support O(1) operations. The underlying array's capacity must be more than enough. It uses the...
Read more >
What is the most memory efficient C++ hash table ... - Quora
Is the most memory efficient hash table I know of. Using compression and very high load factors takes up uses less space per...
Read more >
Writing a damn fast hash table with tiny memory footprints
Here is a very basic table for some high performance hash table I found. The input is 8 M key-value pairs; size of...
Read more >
Hash table tradeoffs: CPU, memory, and variability
Hash tables with collision resolution via separate chaining (including Java's HashMap ) have two memory tiers: (1) the table which exhibits the same...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found