Reading large json lines dataset fails because of rows having different columns
See original GitHub issueSystem information
- OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Ubuntu 20.04.4 LTS
- Modin version (
modin.__version__
): 0.15.3 - Python version: 3.10.6
- Code we can use to reproduce:
import modin.pandas as pd
pd.read_json("Automotive_5.json", lines=True)
where Automotive_5.json
is the “small” (1711519 lines) automotive dataset from https://nijianmo.github.io/amazon/index.html
I tried to reproduce it by generating synthetic json lines file where different json objects have different fields with no avail. Have not yet tried to reduce the dataset to smaller to see when it stops failing.
Describe the problem
I am trying to load large json lines file (which is actually the whole dataset, but it reproduces with the smaller one) and it errors out with ValueError('Columns must be the same across all rows.')
. I can work around this by loading the dataset with pandas
and then transforming that to moding
dataframe, but that is unnecessarily slow.
Source code / logs
Issue Analytics
- State:
- Created a year ago
- Comments:6 (4 by maintainers)
Top Results From Across the Web
Processing large JSON files in Python without running out of ...
With this API the file has to stay open because the JSON parser is reading from the file on demand, as we iterate...
Read more >How to read a large json in pandas? - Stack Overflow
Just pass in lines=True and a chunksize=<something> to pandas.read_json. You'll still need to loop over the JsonReader it returns to access the ...
Read more >How to manage a large JSON file efficiently and quickly - Sease
To work with files containing multiple JSON objects (e.g. several JSON rows) is pretty simple through the Python built-in package called json [1]....
Read more >Trying to load some bigger json files but can't get around the ...
Trying to load some bigger json files but can't get around the variant data size with the strip_outer_array. json looks valid, file is...
Read more >Solve common issues with JSON in SQL Server
Answer. You can create any data structure by adding FOR JSON queries as column expressions that return JSON text. You can also create...
Read more >
Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free
Top Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
@WaDelma thank you for opening this issue! I am able to reproduce this on the latest master.
@mvashishtha I took a look and it seems like we’re making an incorrect assumption about the columns of the resulting dataframe from the first line of the JSON file. For example, in
json_dispatcher
, we try to get a gist of all the possible columns by taking a look at the first line of the JSON file. However, it is possible that other lines will have key-value pairs that do not exist in the first line of the file. Does this fix fall under the things you’ve been working on already?@mertbozkir, your error seems not related to the original issue. To avoid your problem you should put the code under
if __name__ == '__main__':
. See more in this Section about your issue.