Memory leak in pd.read_csv or DataFrame
See original GitHub issueCode Sample, a copy-pastable example if possible
import sys
m = int(sys.argv[1])
n = int(sys.argv[2])
with open('df.csv', 'wt') as f:
for i in range(n-1):
f.write('c' + str(i) + ',')
f.write('c' + str(n-1) + '\n')
for j in range(m):
for i in range(n-1):
f.write('1,')
f.write('1\n')
import psutil
print(psutil.Process().memory_info().rss / 1024**2)
import pandas as pd
df = pd.read_csv('df.csv')
print(df.shape)
print(psutil.Process().memory_info().rss / 1024**2)
import gc
del df
gc.collect()
print(psutil.Process().memory_info().rss / 1024**2)
Problem description
$ ~/miniconda3/bin/python3 g.py 1 1
11.60546875
(1, 1)
64.02734375
64.02734375
$ ~/miniconda3/bin/python3 g.py 5000000 15
11.58203125
(5000000, 15)
640.45703125
68.25
$ ~/miniconda3/bin/python3 g.py 5000000 20
11.84375
(5000000, 20)
1586.65625
823.71875 - !!!
$ ~/miniconda3/bin/python3 g.py 10000000 10
11.83984375
(10000000, 10)
830.92578125
67.984375
$ ~/miniconda3/bin/python3 g.py 10000000 15
11.89453125
(10000000, 15)
2344.3046875
1199.89453125 - !!!
Two issues:
- There is a “standard” leak after reading any CSV OR just creating by
pd.DataFrame()
- ~53Mb. - We see a large leak in some other cases.
cc @gfyoung
Output of pd.show_versions()
(same for 0.21, 0.22, 0.23)
pandas: 0.23.0
pytest: None
pip: 9.0.3
setuptools: 39.0.1
Cython: None
numpy: 1.14.3
scipy: 1.1.0
pyarrow: None
xarray: None
IPython: 6.4.0
sphinx: None
patsy: 0.5.0
dateutil: 2.7.3
pytz: 2018.4
blosc: None
bottleneck: None
tables: None
numexpr: None
feather: None
matplotlib: 2.2.2
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml: None
bs4: None
html5lib: 0.9999999
sqlalchemy: None
pymysql: None
psycopg2: None
jinja2: 2.10
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None
Issue Analytics
- State:
- Created 5 years ago
- Comments:14 (5 by maintainers)
Top Results From Across the Web
pandas.read_csv leaks memory while opening massive files ...
I am using anaconda and my pandas version is 0.23.1. When dealing with single large file, setting chunksize or iterator=True works fine and ......
Read more >Analyzing Python Pandas' memory leak and the fix
Even after doing low_memory=False while reading a CSV using pandas.read_csv , it crashes with MemoryError exception, even though the CSV is not ...
Read more >Python pandas memory leak with read csv - Stack Overflow
I was processing a huge csv in chunks and noticed that it is gradually increasing memory. After a lot of print quit and...
Read more >where does the memory go in pd.read_csv? - Kaggle
After loading CSV, memory use goes to 3.7GB. After deleting the dataframe, memory use drops to 2GB. What is this 2GB? It seems...
Read more >How to avoid Memory errors with Pandas
One strategy for solving this kind of problem is to decrease the amount of data by either reducing the number of rows or...
Read more >
Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free
Top Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Theory: When reading large files with Python, pd.read_csv, csv.reader, plain python io, or with mmap it seems that the thread reading will hold memory. If the same thread does a new read, the already allocated memory will be used, if a new thread reads, it will aquire additional memory. With panda on google the reading of 3 files of app. 100 MB has required app 3GB that is not released. With csv.reader app 300MB, and plain read and mmap app 200MB. So multithreading read of the 3 files can result in extensive storage use (25GB+). This is not my home field, but it has been a frustrating week looking for leaks. If I’m wrong, sorry for the disturbance. (Python 3.7 and 3.8)
Sorry, no difference. If I ensure reading files twice in the same thread, it does not consume or hold more memory. Read in two different threads, and both holds 2GB+ as long as the threads live (at least looks like that to me)