Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Use memory conserving arrow options

See original GitHub issue

Arrow introduces two options which are supposedly helping with memory conservation.

self_destruct frees a converted column as soon as it is converted which renders the pa.Table object useless after the conversion but should reduce memory footprint. This should be a good option for kartothek. The documentation, however, explicitly flags this one as experimental.

split_blocks will not consolidate all same-typed columns into a block yielding lower. Since we do not know how the data frames are used after Kartothek passes these out, probably not a good option

https://arrow.apache.org/docs/python/pandas.html#reducing-memory-use-in-table-to-pandas https://arrow.apache.org/docs/python/generated/pyarrow.Table.html#pyarrow.Table.to_pandas

Assuming these options are stable I have the gut feeling that we should use self_destruct by default. The split_blocks, however, would be suitable only via a config option but I wouldn’t want to add a kwarg with that kind of abstraction leak.

An older option which we should consider is

deduplicate_objects which reduces memory footprint for string cols at cost of conversion speed.

This issue is to discuss the issue and whether or not to include one of them (or both) into kartothek

cc @xhochy @marco-neumann-jdas

Issue Analytics

State:
Created 4 years ago
Comments:5 (3 by maintainers)

Top GitHub Comments

1reaction

fjettercommented, Apr 8, 2020

Sounds like a good default value to me.

I’m thinking the same thing but I wouldn’t want to introduce yet another kwarg. In situations like these I often think about a configuration system for kartothek (similar/same to dask) but not sure if its worth it. hard coding it might have undesired side effects (judging by experience there will be the one or other segfault hidden 😛)

0reactions

marco-neumann-bycommented, Apr 8, 2020

`split_blocks`

Depends. For adding columns to DFs (which we also do when reconstruction primary indices) this is the pandas code. So for reasonable small DFs (column-wise), this could indeed be an improvement. I know from some of our internal pipelines that adding columns to DFs with many columns can be quite expensive due to this hidden operation. So it might be worth a shot.

`deduplicate_objects`

The only case I can imagine where strings are highly diverse is for UUIDs based on rows (like an event stream), but often they are just categories (e.g. "blue", "Spain", …) or ID/UUID references (e.g. an DB entity with an ID/UUID for a color or a country) and therefore having loads of duplicates. Sounds like a good default value to me.