Use memory conserving arrow options
See original GitHub issueArrow introduces two options which are supposedly helping with memory conservation.
self_destruct
frees a converted column as soon as it is converted which renders the pa.Table object useless after the conversion but should reduce memory footprint. This should be a good option for kartothek. The documentation, however, explicitly flags this one as experimental.
split_blocks
will not consolidate all same-typed columns into a block yielding lower. Since we do not know how the data frames are used after Kartothek passes these out, probably not a good option
https://arrow.apache.org/docs/python/pandas.html#reducing-memory-use-in-table-to-pandas https://arrow.apache.org/docs/python/generated/pyarrow.Table.html#pyarrow.Table.to_pandas
Assuming these options are stable I have the gut feeling that we should use self_destruct by default. The split_blocks, however, would be suitable only via a config option but I wouldn’t want to add a kwarg with that kind of abstraction leak.
An older option which we should consider is
deduplicate_objects
which reduces memory footprint for string cols at cost of conversion speed.
This issue is to discuss the issue and whether or not to include one of them (or both) into kartothek
Issue Analytics
- State:
- Created 4 years ago
- Comments:5 (3 by maintainers)
I’m thinking the same thing but I wouldn’t want to introduce yet another kwarg. In situations like these I often think about a configuration system for kartothek (similar/same to dask) but not sure if its worth it. hard coding it might have undesired side effects (judging by experience there will be the one or other segfault hidden 😛)
split_blocks
Depends. For adding columns to DFs (which we also do when reconstruction primary indices) this is the pandas code. So for reasonable small DFs (column-wise), this could indeed be an improvement. I know from some of our internal pipelines that adding columns to DFs with many columns can be quite expensive due to this hidden operation. So it might be worth a shot.
deduplicate_objects
The only case I can imagine where strings are highly diverse is for UUIDs based on rows (like an event stream), but often they are just categories (e.g.
"blue"
,"Spain"
, …) or ID/UUID references (e.g. an DB entity with an ID/UUID for a color or a country) and therefore having loads of duplicates. Sounds like a good default value to me.