More memory efficient hash tables
See original GitHub issueCurrently hash tables are implemented with a Cython wrapper over the klib library. As far as I can tell, the array in which klib uses to store the values is sparse. If large values are being stored in the hash table, this results in memory inefficiency. eg.
complex128 values = [----------, laaaaarge1, ----------, ----------, laaaaarge2, ----------]
- size = 128 * 6 = 768 bits
this inefficiency could be solved by storing the large values in a dense array, and instead storing the indexes to the values in the hash table.
uint16 indexes = [-, 0, -, -, 1, -]
complex128 values = [laaaaarge1, laaaaarge2]
- size: 16 * 6 + 128 * 2 = 352 bits <-- 50% smaller
More generally, the space savings would be
(val_size
- index_size
) * n_buckets
– n_vals
* val_size
Because this would save memory, it would allow for larger hash tables, allowing for fewer collisions and better speed. This would likely outweigh any performance impairments from the additional array access (which is fast because the arrays are Cython arrays).
However, I am not sure what the values of val_size
, n_vals
, and n_buckets
generally are for Pandas/klib and would appreciate any insight on whether this proposal would actually result in a performance improvement.
Issue Analytics
- State:
- Created 6 years ago
- Comments:9 (9 by maintainers)
Top GitHub Comments
@rohanp you could try this. I think this might involve a fairly large change (in cython), and you would have to measure the memory savings AND make sure perf doesn’t degrade too much (as now you are doing a double lookup, though should not be by much as the 2nd access is an array indexing op which is pretty fast).
@rohanp having separate keys/values is much easier impl wise. We just use keys of the appropriate dtype. the values are always
Py_ssize_t
, this just makes things much simpler. so we are only paying the power-of-two bucket cost once.