question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Pandas read_csv out of memory even after adding chunksize

See original GitHub issue

Code Sample, a copy-pastable example if possible

dataframe = pandas.read_csv(inputFolder + dataFile, chunksize=1000000, na_values='null', usecols=fieldsToKeep, low_memory=False, header=0, sep='\t')
tables = map(lambda table: TimeMe(foo)(table, categoryExceptions), dataframe)

def foo(table, exceptions):
    """
    Modifies the columns of the dataframe in place to be categories, largely to save space.
    :type table: pandas.DataFrame
    :type exceptions: set columns not to modify.
    :rtype: pandas.DataFrame
    """
    for c in table:
        if c in exceptions:
            continue

        x = table[c]
        if str(x.dtype) != 'category':
            x.fillna('null', inplace=True)
            table[c] = x.astype('category', copy=False)
    return table

Problem description

I have a 34 GB tsv file and I’ve been reading it using pandas readcsv function with chunksize specified as 1000000. The coomand above works fine with a 8 GB file, but pandas crashes for my 34 GB file, subsequently crashing my iPython notebook.

Issue Analytics

  • State:closed
  • Created 6 years ago
  • Comments:19 (6 by maintainers)

github_iconTop GitHub Comments

11reactions
silva-luanacommented, Apr 23, 2018

I’ve solved the memory error problem using chunks AND low_memory=False

chunksize = 100000
chunks = []
for chunk in pd.read_csv('OFMESSAGEARCHIVE.csv', chunksize=chunksize, low_memory=False):
    chunks.append(chunk)
df = pd.concat(chunks, axis=0)
1reaction
stock-dscommented, Mar 5, 2019

I’ve solved the memory error problem using smaller chunks (size 1). It was like 3x slower, but it didn’t error out. low_memory=False didn’t work

chunksize = 1
chunks = []
for chunk in pd.read_csv('OFMESSAGEARCHIVE.csv', chunksize=chunksize):
    chunks.append(chunk)
df = pd.concat(chunks, axis=0)
Read more comments on GitHub >

github_iconTop Results From Across the Web

Pandas read_csv() 1.2GB file out of memory on VM with ...
This sounds like a job for chunksize . It splits the input process into multiple chunks, reducing the required reading memory. df =...
Read more >
Reducing Pandas memory usage #3: Reading in chunks
Reduce Pandas memory usage by loading and then processing a file in chunks rather than all at once, using Pandas' chunksize option.
Read more >
Scaling to large datasets — pandas 1.1.5 documentation
Pandas provides data structures for in-memory analytics, which makes using pandas to analyze datasets that are larger than memory datasets somewhat tricky.
Read more >
pandas 0.18: out of memory error when reading CSV file with ...
The question is a duplicate. When you do read_csv but don't specify dtypes, if you read in floats, ints, dates and categoricals as...
Read more >
Loading large datasets in Pandas - Towards Data Science
To enable chunking, we will declare the size of the chunk in the beginning. Then using read_csv() with the chunksize parameter, returns an ......
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found