Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

More memory efficient hash tables

See original GitHub issue

Currently hash tables are implemented with a Cython wrapper over the klib library. As far as I can tell, the array in which klib uses to store the values is sparse. If large values are being stored in the hash table, this results in memory inefficiency. eg.

complex128 values = [----------, laaaaarge1, ----------, ----------, laaaaarge2, ----------]

size = 128 * 6 = 768 bits

this inefficiency could be solved by storing the large values in a dense array, and instead storing the indexes to the values in the hash table.

uint16 indexes = [-, 0, -, -, 1, -]
complex128 values = [laaaaarge1, laaaaarge2]

size: 16 * 6 + 128 * 2 = 352 bits <-- 50% smaller

More generally, the space savings would be (val_size - index_size) * n_buckets – n_vals * val_size

Because this would save memory, it would allow for larger hash tables, allowing for fewer collisions and better speed. This would likely outweigh any performance impairments from the additional array access (which is fast because the arrays are Cython arrays).

However, I am not sure what the values of val_size, n_vals, and n_buckets generally are for Pandas/klib and would appreciate any insight on whether this proposal would actually result in a performance improvement.

Issue Analytics

State:
Created 6 years ago
Comments:9 (9 by maintainers)

Top GitHub Comments

1reaction

jrebackcommented, May 23, 2017

@rohanp you could try this. I think this might involve a fairly large change (in cython), and you would have to measure the memory savings AND make sure perf doesn’t degrade too much (as now you are doing a double lookup, though should not be by much as the 2nd access is an array indexing op which is pretty fast).

0reactions

jrebackcommented, May 23, 2017

@rohanp having separate keys/values is much easier impl wise. We just use keys of the appropriate dtype. the values are always Py_ssize_t, this just makes things much simpler. so we are only paying the power-of-two bucket cost once.

Top Results From Across the Web

Open-sourcing F14 for faster, more memory-efficient hash tables

Hash tables provide a fast way to maintain a set of keys or map keys to values, even if the keys are objects,...

Why does a hash table take up more memory than other data ...

In order to make hash table fast, which support O(1) operations. The underlying array's capacity must be more than enough. It uses the...

What is the most memory efficient C++ hash table ... - Quora

Is the most memory efficient hash table I know of. Using compression and very high load factors takes up uses less space per...

Writing a damn fast hash table with tiny memory footprints

Here is a very basic table for some high performance hash table I found. The input is 8 M key-value pairs; size of...

Hash table tradeoffs: CPU, memory, and variability

Hash tables with collision resolution via separate chaining (including Java's HashMap ) have two memory tiers: (1) the table which exhibits the same...