Variable deletion consumes a lot of memory
See original GitHub issueHi team,
I have been having issues with pandas memory management. Specifically, there is an (at least for me) unavoidable peak of memory which occurs when attempting to remove variables from a data set. It should be (almost) free! I am getting rid of part of the data, but it still needs to allocate a big amount of memory producing MemoryErrors.
Just to give you a little bit of context, I am working with a DataFrame which contains 33M of rows and 500 columns (just a big one!), almost all of them numeric, in a machine with 360GB of RAM. The whole data set fits in memory and I can successfully apply some transformations to the variables. The problem comes when I need to drop a 10% of the columns contained in the table. It just produces a big peak of memory leading to a MemoryError
. Before performing this operation, there are more than 80GB of memory available!.
I tried to use the following methods for removing the columns and all of them failed.
drop()
with or withoutinplace
parameterpop()
reindex()
reindex_axis()
del df[column]
in a loop over the columns to be removed__delitem__(column)
in a loop over the columns to be removedpop()
anddrop()
in a loop over the columns to be removed.- I also tried to reasign the columns overwritting the data frame using indexing with
loc()
andiloc()
but it does not help.
I found that the drop method with inplace is the most efficient one but it still generates a huge peak.
I would like to discuss if there is there any way of implementing (or is it already implemented by any chance) a method for more efficiently removing variables without generating more memory consumption…
Thank you Iván
Issue Analytics
- State:
- Created 6 years ago
- Reactions:6
- Comments:8 (6 by maintainers)
Top GitHub Comments
Is there any update on this issue? SO far two contradicting solutions have been proposed.
and
What is the best way to delete a column without running out of memory?
@giangdaotr I’ve made a demo to show the cost of using
del df[col]
vsdf.drop(...)
, thedel
solution in my example is indeed very expensive. I wonder if the block manager is duplicating RAM under certain conditions (which @jreback notes above). Demo here https://github.com/ianozsvald/ipython_memory_usage/blob/master/src/ipython_memory_usage/examples/example_usage_np_pd.ipynb (seeIn[16]
onwards).Personally I’m keen to know more because reasoning about memory using in Pandas (and when/if you get a view or a copy) is pretty tricky, I’m using my
ipython_memory_usage
tool to try to build up some demos. I’m happy to collect use cases here: https://github.com/ianozsvald/ipython_memory_usage/issues/30