read_json with lines=True not using buff/cache memory
See original GitHub issueI have a 3.2 GB json file that I am trying to read into pandas using pd.read_json(lines=True). When I run that, I get a MemoryError, even though my system has >12GB of available memory. This is Pandas version 0.20.2.
I’m on Ubuntu, and the free
command shows >12GB of “Available” memory, most of which is “buff/cache”.
I’m able to read the file into a dataframe by iterating over the file like so:
dfs = []
with open(fp, 'r') as f:
while True:
lines = list(itertools.islice(f, 1000))
if lines:
lines_str = ''.join(lines)
dfs.append(pd.read_json(StringIO(lines_str), lines=True))
else:
break
df = pd.concat(dfs)
You’ll notice that at the end of this I have the original data in memory twice (in the list and in the final df), but no problems.
It seems that pd.read_json
with lines=True
doesn’t use the available memory, which looks to me like a bug.
Issue Analytics
- State:
- Created 6 years ago
- Reactions:2
- Comments:27 (20 by maintainers)
Top Results From Across the Web
Java OutOfMemoryError due to Linux RAM disk cache not freed
According to our sysadmin, the problem is that the RAM that the kernel uses as a disk cache (apparently all but 8MB) is...
Read more >Why cannot use buff/cache? - Server Fault
The column "buff/cache" reports the sum of memory in use for buffers and cache. Buffers cannot be reclaimed because they are needed by...
Read more >How to read .json to output an specific number? - Ask Ubuntu
A better answer can probably be made using a tool like jq but a brute force method would be: ... Not pretty but...
Read more >"buff/cache" is very high, how I can free it? [duplicate]
Both you and Linux agree that memory taken by applications is "used", while memory that isn't used for anything is "free". But how...
Read more >Linux memory is used as buff/cache - Operating System
This results in journald restarting. When that happens, journald will not release file handles, resulting in memory used by logs never being ...
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
The problem still exists , I am loading a 5GB json file with 16 GB ram ,but still i get memory error . The lines true attribute doesnot work as expected still
Here goes: https://github.com/pandas-dev/pandas/pull/17168.