question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Memory leak in pd.read_csv or DataFrame

See original GitHub issue

Code Sample, a copy-pastable example if possible

import sys

m = int(sys.argv[1])
n = int(sys.argv[2])

with open('df.csv', 'wt') as f:
    for i in range(n-1):
        f.write('c' + str(i) + ',')
    f.write('c' + str(n-1) + '\n')
    for j in range(m):
        for i in range(n-1):
            f.write('1,')
        f.write('1\n')


import psutil

print(psutil.Process().memory_info().rss / 1024**2)

import pandas as pd
df = pd.read_csv('df.csv')

print(df.shape)
print(psutil.Process().memory_info().rss / 1024**2)

import gc
del df
gc.collect()

print(psutil.Process().memory_info().rss / 1024**2)

Problem description

$ ~/miniconda3/bin/python3 g.py 1 1
11.60546875
(1, 1)
64.02734375
64.02734375

$ ~/miniconda3/bin/python3 g.py 5000000 15
11.58203125
(5000000, 15)
640.45703125
68.25

$ ~/miniconda3/bin/python3 g.py 5000000 20
11.84375
(5000000, 20)
1586.65625
823.71875 - !!!

$ ~/miniconda3/bin/python3 g.py 10000000 10
11.83984375
(10000000, 10)
830.92578125
67.984375

$ ~/miniconda3/bin/python3 g.py 10000000 15
11.89453125
(10000000, 15)
2344.3046875
1199.89453125 - !!!

Two issues:

  1. There is a “standard” leak after reading any CSV OR just creating by pd.DataFrame() - ~53Mb.
  2. We see a large leak in some other cases.

cc @gfyoung

Output of pd.show_versions()

(same for 0.21, 0.22, 0.23)

pandas: 0.23.0 pytest: None pip: 9.0.3 setuptools: 39.0.1 Cython: None numpy: 1.14.3 scipy: 1.1.0 pyarrow: None xarray: None IPython: 6.4.0 sphinx: None patsy: 0.5.0 dateutil: 2.7.3 pytz: 2018.4 blosc: None bottleneck: None tables: None numexpr: None feather: None matplotlib: 2.2.2 openpyxl: None xlrd: None xlwt: None xlsxwriter: None lxml: None bs4: None html5lib: 0.9999999 sqlalchemy: None pymysql: None psycopg2: None jinja2: 2.10 s3fs: None fastparquet: None pandas_gbq: None pandas_datareader: None

Issue Analytics

  • State:closed
  • Created 5 years ago
  • Comments:14 (5 by maintainers)

github_iconTop GitHub Comments

3reactions
gberthcommented, Aug 20, 2020

Theory: When reading large files with Python, pd.read_csv, csv.reader, plain python io, or with mmap it seems that the thread reading will hold memory. If the same thread does a new read, the already allocated memory will be used, if a new thread reads, it will aquire additional memory. With panda on google the reading of 3 files of app. 100 MB has required app 3GB that is not released. With csv.reader app 300MB, and plain read and mmap app 200MB. So multithreading read of the 3 files can result in extensive storage use (25GB+). This is not my home field, but it has been a frustrating week looking for leaks. If I’m wrong, sorry for the disturbance. (Python 3.7 and 3.8)

0reactions
gberthcommented, Aug 25, 2020

Sorry, no difference. If I ensure reading files twice in the same thread, it does not consume or hold more memory. Read in two different threads, and both holds 2GB+ as long as the threads live (at least looks like that to me)

Read more comments on GitHub >

github_iconTop Results From Across the Web

pandas.read_csv leaks memory while opening massive files ...
I am using anaconda and my pandas version is 0.23.1. When dealing with single large file, setting chunksize or iterator=True works fine and ......
Read more >
Analyzing Python Pandas' memory leak and the fix
Even after doing low_memory=False while reading a CSV using pandas.read_csv , it crashes with MemoryError exception, even though the CSV is not ...
Read more >
Python pandas memory leak with read csv - Stack Overflow
I was processing a huge csv in chunks and noticed that it is gradually increasing memory. After a lot of print quit and...
Read more >
where does the memory go in pd.read_csv? - Kaggle
After loading CSV, memory use goes to 3.7GB. After deleting the dataframe, memory use drops to 2GB. What is this 2GB? It seems...
Read more >
How to avoid Memory errors with Pandas
One strategy for solving this kind of problem is to decrease the amount of data by either reducing the number of rows or...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found