question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

read_json with lines=True not using buff/cache memory

See original GitHub issue

I have a 3.2 GB json file that I am trying to read into pandas using pd.read_json(lines=True). When I run that, I get a MemoryError, even though my system has >12GB of available memory. This is Pandas version 0.20.2.

I’m on Ubuntu, and the free command shows >12GB of “Available” memory, most of which is “buff/cache”.

I’m able to read the file into a dataframe by iterating over the file like so:

dfs = []
with open(fp, 'r') as f:
    while True:
        lines = list(itertools.islice(f, 1000))
        
        if lines:
            lines_str = ''.join(lines)
            dfs.append(pd.read_json(StringIO(lines_str), lines=True))
        else:
            break

df = pd.concat(dfs)

You’ll notice that at the end of this I have the original data in memory twice (in the list and in the final df), but no problems.

It seems that pd.read_json with lines=True doesn’t use the available memory, which looks to me like a bug.

Issue Analytics

  • State:closed
  • Created 6 years ago
  • Reactions:2
  • Comments:27 (20 by maintainers)

github_iconTop GitHub Comments

4reactions
vamlumbercommented, Feb 8, 2019

The problem still exists , I am loading a 5GB json file with 16 GB ram ,but still i get memory error . The lines true attribute doesnot work as expected still

1reaction
louispotokcommented, Aug 3, 2017
Read more comments on GitHub >

github_iconTop Results From Across the Web

Java OutOfMemoryError due to Linux RAM disk cache not freed
According to our sysadmin, the problem is that the RAM that the kernel uses as a disk cache (apparently all but 8MB) is...
Read more >
Why cannot use buff/cache? - Server Fault
The column "buff/cache" reports the sum of memory in use for buffers and cache. Buffers cannot be reclaimed because they are needed by...
Read more >
How to read .json to output an specific number? - Ask Ubuntu
A better answer can probably be made using a tool like jq but a brute force method would be: ... Not pretty but...
Read more >
"buff/cache" is very high, how I can free it? [duplicate]
Both you and Linux agree that memory taken by applications is "used", while memory that isn't used for anything is "free". But how...
Read more >
Linux memory is used as buff/cache - Operating System
This results in journald restarting. When that happens, journald will not release file handles, resulting in memory used by logs never being ...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found