Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Pandas read_csv out of memory even after adding chunksize

See original GitHub issue

Code Sample, a copy-pastable example if possible

dataframe = pandas.read_csv(inputFolder + dataFile, chunksize=1000000, na_values='null', usecols=fieldsToKeep, low_memory=False, header=0, sep='\t')
tables = map(lambda table: TimeMe(foo)(table, categoryExceptions), dataframe)

def foo(table, exceptions):
    """
    Modifies the columns of the dataframe in place to be categories, largely to save space.
    :type table: pandas.DataFrame
    :type exceptions: set columns not to modify.
    :rtype: pandas.DataFrame
    """
    for c in table:
        if c in exceptions:
            continue

        x = table[c]
        if str(x.dtype) != 'category':
            x.fillna('null', inplace=True)
            table[c] = x.astype('category', copy=False)
    return table

Problem description

I have a 34 GB tsv file and I’ve been reading it using pandas readcsv function with chunksize specified as 1000000. The coomand above works fine with a 8 GB file, but pandas crashes for my 34 GB file, subsequently crashing my iPython notebook.

Issue Analytics

State:
Created 6 years ago
Comments:19 (6 by maintainers)

Top GitHub Comments

11reactions

silva-luanacommented, Apr 23, 2018

I’ve solved the memory error problem using chunks AND low_memory=False

chunksize = 100000
chunks = []
for chunk in pd.read_csv('OFMESSAGEARCHIVE.csv', chunksize=chunksize, low_memory=False):
    chunks.append(chunk)
df = pd.concat(chunks, axis=0)

1reaction

stock-dscommented, Mar 5, 2019

I’ve solved the memory error problem using smaller chunks (size 1). It was like 3x slower, but it didn’t error out. low_memory=False didn’t work

chunksize = 1
chunks = []
for chunk in pd.read_csv('OFMESSAGEARCHIVE.csv', chunksize=chunksize):
    chunks.append(chunk)
df = pd.concat(chunks, axis=0)

Top Results From Across the Web

Pandas read_csv() 1.2GB file out of memory on VM with ...

This sounds like a job for chunksize . It splits the input process into multiple chunks, reducing the required reading memory. df =...

Reducing Pandas memory usage #3: Reading in chunks

Reduce Pandas memory usage by loading and then processing a file in chunks rather than all at once, using Pandas' chunksize option.

Scaling to large datasets — pandas 1.1.5 documentation

Pandas provides data structures for in-memory analytics, which makes using pandas to analyze datasets that are larger than memory datasets somewhat tricky.

pandas 0.18: out of memory error when reading CSV file with ...

The question is a duplicate. When you do read_csv but don't specify dtypes, if you read in floats, ints, dates and categoricals as...

Loading large datasets in Pandas - Towards Data Science

To enable chunking, we will declare the size of the chunk in the beginning. Then using read_csv() with the chunksize parameter, returns an ......