memory leak in MultiIndex
See original GitHub issueCode Sample, a copy-pastable example if possible
# Your code here
import pandas as pd
import numpy as np
import gc
import psutil
def totmem(p):
# total memory used by process in MB
info = p.memory_info()
return 1e-6*(info.vms + info.rss)
dat_size = 6000000
# make dataset with no data
# uncomment for regular index
# dat = pd.DataFrame(index=np.arange(dat_size))
# uncomment for MultiIndex index
dat = pd.DataFrame(index=pd.MultiIndex.from_arrays((np.arange(dat_size), np.arange(dat_size))))
# make bool vector for subsetting
sub = np.ones(dat_size, dtype=bool)
# init psutil stuff
p = psutil.Process()
gc.collect()
ram = totmem(p)
for i in range(10):
dat.iloc[sub, :] # leak happens here
gc.collect()
print(int(totmem(p) - ram))
ram = totmem(p)
Problem description
this program leaks at a rate of 191 MB / cycle; it eventually runs out of memory if the loop goes on indefinitely. In the program we use a bool vector to subset a dataset. If the dataset index is MultiIndex we observe a memory leak (as reported by the print statement). If the index is a regular mono-index no such leak is observed.
Expected Output
- the program should not run out memory.
- every output line should be zero.
Output of pd.show_versions()
[paste the output of pd.show_versions()
here below this line]
INSTALLED VERSIONS
python: 3.6.6.final.0 python-bits: 64 OS: Linux pandas: 0.23.4 numpy: 1.15.1
Issue Analytics
- State:
- Created 5 years ago
- Reactions:1
- Comments:11 (7 by maintainers)
Top Results From Across the Web
Memory leak in pandas multiIndex (with minmum reproducible ...
The issue cannot be due to lazy allocation of output , since that array is only 6Mb, as reported by output.nbytes , while...
Read more >[Solved]-Memory leak from pyarrow?-Pandas,Python
Memory leak from pyarrow? · Memory leak when reading value from a Pandas Dataframe · How to delete multiple pandas (python) dataframes from...
Read more >What's new in 0.25.0 (July 18, 2019) - Pandas
Printing of MultiIndex instances now shows tuples of each row and ensures that ... Fixed memory leak in DataFrame.to_json() when dealing with numeric...
Read more >How to avoid Memory errors with Pandas
One strategy for solving this kind of problem is to decrease the amount of data by either reducing the number of rows or...
Read more >Boost.MultiIndex Documentation - Tutorial - Index types - 1.75.0
Ranked indices provide the same interface as ordered indices plus several rank-related operations. The cost of this extra functionality is higher memory ......
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
Numpy’s minimum version will be 1.17.3 in the next release (1.3) which shouldnt have this bug. Going to close but happy to reopen if this resurfaces
Can anyone confirm this issue is fixed for numpy 0.15.3 and above?
Update: I did an install of 0.15.3. OP’s script still reports the same memory leak. Will try some other versions of numpy and see if it got fixed anywhere.Update 2: Turns out my environment was a mess. A proper upgrade to 0.15.3 does indeed solve the issue! My pandas version is 0.23.4 and python version 3.6.6 like the others up above. Please feel free to verify yourselves and close this ticket.