question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Reading large json lines dataset fails because of rows having different columns

See original GitHub issue

System information

  • OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Ubuntu 20.04.4 LTS
  • Modin version (modin.__version__): 0.15.3
  • Python version: 3.10.6
  • Code we can use to reproduce:
import modin.pandas as pd
pd.read_json("Automotive_5.json", lines=True)

where Automotive_5.json is the “small” (1711519 lines) automotive dataset from https://nijianmo.github.io/amazon/index.html I tried to reproduce it by generating synthetic json lines file where different json objects have different fields with no avail. Have not yet tried to reduce the dataset to smaller to see when it stops failing.

Describe the problem

I am trying to load large json lines file (which is actually the whole dataset, but it reproduces with the smaller one) and it errors out with ValueError('Columns must be the same across all rows.'). I can work around this by loading the dataset with pandas and then transforming that to moding dataframe, but that is unnecessarily slow.

Source code / logs

output.txt

Issue Analytics

  • State:open
  • Created a year ago
  • Comments:6 (4 by maintainers)

github_iconTop GitHub Comments

1reaction
pyritocommented, Sep 12, 2022

@WaDelma thank you for opening this issue! I am able to reproduce this on the latest master.

@mvashishtha I took a look and it seems like we’re making an incorrect assumption about the columns of the resulting dataframe from the first line of the JSON file. For example, in json_dispatcher, we try to get a gist of all the possible columns by taking a look at the first line of the JSON file. However, it is possible that other lines will have key-value pairs that do not exist in the first line of the file. Does this fix fall under the things you’ve been working on already?

0reactions
YarShevcommented, Dec 12, 2022

@mertbozkir, your error seems not related to the original issue. To avoid your problem you should put the code under if __name__ == '__main__':. See more in this Section about your issue.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Processing large JSON files in Python without running out of ...
With this API the file has to stay open because the JSON parser is reading from the file on demand, as we iterate...
Read more >
How to read a large json in pandas? - Stack Overflow
Just pass in lines=True and a chunksize=<something> to pandas.read_json. You'll still need to loop over the JsonReader it returns to access the ...
Read more >
How to manage a large JSON file efficiently and quickly - Sease
To work with files containing multiple JSON objects (e.g. several JSON rows) is pretty simple through the Python built-in package called json [1]....
Read more >
Trying to load some bigger json files but can't get around the ...
Trying to load some bigger json files but can't get around the variant data size with the strip_outer_array. json looks valid, file is...
Read more >
Solve common issues with JSON in SQL Server
Answer. You can create any data structure by adding FOR JSON queries as column expressions that return JSON text. You can also create...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found