Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Memory leak in pd.read_csv or DataFrame

See original GitHub issue

Code Sample, a copy-pastable example if possible

import sys

m = int(sys.argv[1])
n = int(sys.argv[2])

with open('df.csv', 'wt') as f:
    for i in range(n-1):
        f.write('c' + str(i) + ',')
    f.write('c' + str(n-1) + '\n')
    for j in range(m):
        for i in range(n-1):
            f.write('1,')
        f.write('1\n')


import psutil

print(psutil.Process().memory_info().rss / 1024**2)

import pandas as pd
df = pd.read_csv('df.csv')

print(df.shape)
print(psutil.Process().memory_info().rss / 1024**2)

import gc
del df
gc.collect()

print(psutil.Process().memory_info().rss / 1024**2)

Problem description

$ ~/miniconda3/bin/python3 g.py 1 1
11.60546875
(1, 1)
64.02734375
64.02734375

$ ~/miniconda3/bin/python3 g.py 5000000 15
11.58203125
(5000000, 15)
640.45703125
68.25

$ ~/miniconda3/bin/python3 g.py 5000000 20
11.84375
(5000000, 20)
1586.65625
823.71875 - !!!

$ ~/miniconda3/bin/python3 g.py 10000000 10
11.83984375
(10000000, 10)
830.92578125
67.984375

$ ~/miniconda3/bin/python3 g.py 10000000 15
11.89453125
(10000000, 15)
2344.3046875
1199.89453125 - !!!

Two issues:

There is a “standard” leak after reading any CSV OR just creating by pd.DataFrame() - ~53Mb.
We see a large leak in some other cases.

cc @gfyoung

Output of `pd.show_versions()`

(same for 0.21, 0.22, 0.23)

pandas: 0.23.0 pytest: None pip: 9.0.3 setuptools: 39.0.1 Cython: None numpy: 1.14.3 scipy: 1.1.0 pyarrow: None xarray: None IPython: 6.4.0 sphinx: None patsy: 0.5.0 dateutil: 2.7.3 pytz: 2018.4 blosc: None bottleneck: None tables: None numexpr: None feather: None matplotlib: 2.2.2 openpyxl: None xlrd: None xlwt: None xlsxwriter: None lxml: None bs4: None html5lib: 0.9999999 sqlalchemy: None pymysql: None psycopg2: None jinja2: 2.10 s3fs: None fastparquet: None pandas_gbq: None pandas_datareader: None

Issue Analytics

State:
Created 5 years ago
Comments:14 (5 by maintainers)

Top GitHub Comments

3reactions

gberthcommented, Aug 20, 2020

Theory: When reading large files with Python, pd.read_csv, csv.reader, plain python io, or with mmap it seems that the thread reading will hold memory. If the same thread does a new read, the already allocated memory will be used, if a new thread reads, it will aquire additional memory. With panda on google the reading of 3 files of app. 100 MB has required app 3GB that is not released. With csv.reader app 300MB, and plain read and mmap app 200MB. So multithreading read of the 3 files can result in extensive storage use (25GB+). This is not my home field, but it has been a frustrating week looking for leaks. If I’m wrong, sorry for the disturbance. (Python 3.7 and 3.8)

0reactions

gberthcommented, Aug 25, 2020

Sorry, no difference. If I ensure reading files twice in the same thread, it does not consume or hold more memory. Read in two different threads, and both holds 2GB+ as long as the threads live (at least looks like that to me)