question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Use memory conserving arrow options

See original GitHub issue

Arrow introduces two options which are supposedly helping with memory conservation.

self_destruct frees a converted column as soon as it is converted which renders the pa.Table object useless after the conversion but should reduce memory footprint. This should be a good option for kartothek. The documentation, however, explicitly flags this one as experimental.

split_blocks will not consolidate all same-typed columns into a block yielding lower. Since we do not know how the data frames are used after Kartothek passes these out, probably not a good option

https://arrow.apache.org/docs/python/pandas.html#reducing-memory-use-in-table-to-pandas https://arrow.apache.org/docs/python/generated/pyarrow.Table.html#pyarrow.Table.to_pandas

Assuming these options are stable I have the gut feeling that we should use self_destruct by default. The split_blocks, however, would be suitable only via a config option but I wouldn’t want to add a kwarg with that kind of abstraction leak.

An older option which we should consider is

deduplicate_objects which reduces memory footprint for string cols at cost of conversion speed.

This issue is to discuss the issue and whether or not to include one of them (or both) into kartothek

cc @xhochy @marco-neumann-jdas

Issue Analytics

  • State:open
  • Created 4 years ago
  • Comments:5 (3 by maintainers)

github_iconTop GitHub Comments

1reaction
fjettercommented, Apr 8, 2020

Sounds like a good default value to me.

I’m thinking the same thing but I wouldn’t want to introduce yet another kwarg. In situations like these I often think about a configuration system for kartothek (similar/same to dask) but not sure if its worth it. hard coding it might have undesired side effects (judging by experience there will be the one or other segfault hidden 😛)

0reactions
marco-neumann-bycommented, Apr 8, 2020

split_blocks

Depends. For adding columns to DFs (which we also do when reconstruction primary indices) this is the pandas code. So for reasonable small DFs (column-wise), this could indeed be an improvement. I know from some of our internal pipelines that adding columns to DFs with many columns can be quite expensive due to this hidden operation. So it might be worth a shot.

deduplicate_objects

The only case I can imagine where strings are highly diverse is for UUIDs based on rows (like an event stream), but often they are just categories (e.g. "blue", "Spain", …) or ID/UUID references (e.g. an DB entity with an ID/UUID for a color or a country) and therefore having loads of duplicates. Sounds like a good default value to me.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Memory and IO Interfaces — Apache Arrow v10.0.1
The Buffer object wraps the C++ arrow::Buffer type which is the primary tool for memory management in Apache Arrow in C++. It permits...
Read more >
How to use the Memory to Store and Recall values on the TI ...
A quick tutorial on how to use the memory function to store (STO) and recall (RCL) values for use in equations on the...
Read more >
Wrapping C++ Arrow - why and how? Yoni Davidson - YouTube
Apache Arrow is a cross-language development platform for in- memory data. It specifies a standardized language-independent columnar memory ...
Read more >
Saving memory with Pandas 1.3's new string dtype
Using the new Arrow string dtype. Let's compare the memory usage of all three dtypes, starting by storing a series of random strings...
Read more >
Saving Settings to Memory and Using Saved Settings - Epson
Turn on the projector and display an image. · Press the menu button on the projector remote control. · Press the arrow buttons...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found