`dask.bag` JSONDecodeError when reading multiline json arrays
See original GitHub issueWhen using dask.bag
to read json files I get a JSONDecodeError
when the json in the file is multiline.
import json
import dask.bag as db
db.read_text('single-line.json').map(json.loads).compute()
[[{'a': 'b'}, {'c': 'd'}]]
db.read_text('multi-line.json').map(json.loads).compute()
JSONDecodeError: Expecting value: line 2 column 1 (char 2)
Here are the example files look like:
- single-line.json
[{"a": "b"}, {"c": "d"}]
- multi-line.json
[
{"a": "b"},
{"c": "d"}
]
Is this a bug or is there something I’m missing?
Also worth noting that I can read the multi-line file using just the standard lib
with open('multi-line.json') as f:
data = f.read()
print(json.loads(data))
[{'a': 'b'}, {'c': 'd'}]
This is using Python 3.6.0 and dask 0.15.0
Issue Analytics
- State:
- Created 6 years ago
- Comments:9 (5 by maintainers)
Top Results From Across the Web
`dask.bag` JSONDecodeError when reading multiline json ...
When using dask.bag to read json files I get a JSONDecodeError when the json in the file is multiline. import json import dask.bag...
Read more >dask.dataframe.read_json - Dask documentation
The underlying function that dask will use to read JSON files. By default, this will be the pandas JSON reader ( pd.read_json )....
Read more >Read JSON into Dask DataFrames - Coiled.io
This blog post explains how to read JSON into Dask DataFrames. ... tells Dask to split the values of each object into different...
Read more >simplejson — JSON encoder and decoder — simplejson 3.18 ...
Deserialize s (a str or unicode instance containing a JSON document) to a Python object. JSONDecodeError will be raised if the given JSON...
Read more >Using JSON in Go: A guide with examples - LogRocket Blog
For example, we often have to read JSON files to populate Go objects ... The Product struct has a Seller struct as a...
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
http://docs.dask.org/en/latest/bag-creation.html
So each partition of your bag is a stringified JSON document. I suspect that
.map_partitions(json.loads)
should work. If not, I’d recommend a stackoverflow question that has a minimal example of what you have and what you’re trying to achieve. Those questions are generally easier to find later than the bottom of a GitHub issue 😃For dask.delayed: http://docs.dask.org/en/latest/delayed.html
wow thanks for the speedy responses! it is not line-delimited, unfortunately- it is a json blob per file. I’ll have to look into
map_partitions
anddask.delayed
, I’m not familiar- transitioning from spark right now. if you could link to an example usage that would be helpfula small example file would be: