Pandas read_csv out of memory even after adding chunksize
See original GitHub issueCode Sample, a copy-pastable example if possible
dataframe = pandas.read_csv(inputFolder + dataFile, chunksize=1000000, na_values='null', usecols=fieldsToKeep, low_memory=False, header=0, sep='\t')
tables = map(lambda table: TimeMe(foo)(table, categoryExceptions), dataframe)
def foo(table, exceptions):
"""
Modifies the columns of the dataframe in place to be categories, largely to save space.
:type table: pandas.DataFrame
:type exceptions: set columns not to modify.
:rtype: pandas.DataFrame
"""
for c in table:
if c in exceptions:
continue
x = table[c]
if str(x.dtype) != 'category':
x.fillna('null', inplace=True)
table[c] = x.astype('category', copy=False)
return table
Problem description
I have a 34 GB tsv file and I’ve been reading it using pandas readcsv function with chunksize specified as 1000000. The coomand above works fine with a 8 GB file, but pandas crashes for my 34 GB file, subsequently crashing my iPython notebook.
Issue Analytics
- State:
- Created 6 years ago
- Comments:19 (6 by maintainers)
Top Results From Across the Web
Pandas read_csv() 1.2GB file out of memory on VM with ...
This sounds like a job for chunksize . It splits the input process into multiple chunks, reducing the required reading memory. df =...
Read more >Reducing Pandas memory usage #3: Reading in chunks
Reduce Pandas memory usage by loading and then processing a file in chunks rather than all at once, using Pandas' chunksize option.
Read more >Scaling to large datasets — pandas 1.1.5 documentation
Pandas provides data structures for in-memory analytics, which makes using pandas to analyze datasets that are larger than memory datasets somewhat tricky.
Read more >pandas 0.18: out of memory error when reading CSV file with ...
The question is a duplicate. When you do read_csv but don't specify dtypes, if you read in floats, ints, dates and categoricals as...
Read more >Loading large datasets in Pandas - Towards Data Science
To enable chunking, we will declare the size of the chunk in the beginning. Then using read_csv() with the chunksize parameter, returns an ......
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
I’ve solved the memory error problem using chunks AND low_memory=False
I’ve solved the memory error problem using smaller chunks (size 1). It was like 3x slower, but it didn’t error out. low_memory=False didn’t work