Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

read_json with lines=True not using buff/cache memory

See original GitHub issue

I have a 3.2 GB json file that I am trying to read into pandas using pd.read_json(lines=True). When I run that, I get a MemoryError, even though my system has >12GB of available memory. This is Pandas version 0.20.2.

I’m on Ubuntu, and the free command shows >12GB of “Available” memory, most of which is “buff/cache”.

I’m able to read the file into a dataframe by iterating over the file like so:

dfs = []
with open(fp, 'r') as f:
    while True:
        lines = list(itertools.islice(f, 1000))
        
        if lines:
            lines_str = ''.join(lines)
            dfs.append(pd.read_json(StringIO(lines_str), lines=True))
        else:
            break

df = pd.concat(dfs)

You’ll notice that at the end of this I have the original data in memory twice (in the list and in the final df), but no problems.

It seems that pd.read_json with lines=True doesn’t use the available memory, which looks to me like a bug.

Issue Analytics

State:
Created 6 years ago
Reactions:2
Comments:27 (20 by maintainers)

Top GitHub Comments

4reactions

vamlumbercommented, Feb 8, 2019

The problem still exists , I am loading a 5GB json file with 16 GB ram ,but still i get memory error . The lines true attribute doesnot work as expected still

1reaction

louispotokcommented, Aug 3, 2017

Here goes: https://github.com/pandas-dev/pandas/pull/17168.