question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

`dask.bag` JSONDecodeError when reading multiline json arrays

See original GitHub issue

When using dask.bag to read json files I get a JSONDecodeError when the json in the file is multiline.

import json
import dask.bag as db

db.read_text('single-line.json').map(json.loads).compute()
[[{'a': 'b'}, {'c': 'd'}]]    

db.read_text('multi-line.json').map(json.loads).compute()
JSONDecodeError: Expecting value: line 2 column 1 (char 2)

Here are the example files look like:

  • single-line.json
[{"a": "b"}, {"c": "d"}]
  • multi-line.json
[
    {"a": "b"},
    {"c": "d"}
]

Is this a bug or is there something I’m missing?

Also worth noting that I can read the multi-line file using just the standard lib

with open('multi-line.json') as f:
    data = f.read()
    print(json.loads(data))

[{'a': 'b'}, {'c': 'd'}]

This is using Python 3.6.0 and dask 0.15.0

Issue Analytics

  • State:closed
  • Created 6 years ago
  • Comments:9 (5 by maintainers)

github_iconTop GitHub Comments

1reaction
TomAugspurgercommented, Nov 5, 2018

http://docs.dask.org/en/latest/bag-creation.html

So each partition of your bag is a stringified JSON document. I suspect that .map_partitions(json.loads) should work. If not, I’d recommend a stackoverflow question that has a minimal example of what you have and what you’re trying to achieve. Those questions are generally easier to find later than the bottom of a GitHub issue 😃

For dask.delayed: http://docs.dask.org/en/latest/delayed.html

0reactions
AlJohricommented, Nov 5, 2018

wow thanks for the speedy responses! it is not line-delimited, unfortunately- it is a json blob per file. I’ll have to look into map_partitions and dask.delayed, I’m not familiar- transitioning from spark right now. if you could link to an example usage that would be helpful

a small example file would be:

{
  "url": "http://example.com",
  "headline": "",
  "byline": "",
  "timestamp": "",
  "html": "",
  "description": "",
  "keywords": "",
  "published_time": "",
  "modified_time": "",
  "topics": []
}
Read more comments on GitHub >

github_iconTop Results From Across the Web

`dask.bag` JSONDecodeError when reading multiline json ...
When using dask.bag to read json files I get a JSONDecodeError when the json in the file is multiline. import json import dask.bag...
Read more >
dask.dataframe.read_json - Dask documentation
The underlying function that dask will use to read JSON files. By default, this will be the pandas JSON reader ( pd.read_json )....
Read more >
Read JSON into Dask DataFrames - Coiled.io
This blog post explains how to read JSON into Dask DataFrames. ... tells Dask to split the values of each object into different...
Read more >
simplejson — JSON encoder and decoder — simplejson 3.18 ...
Deserialize s (a str or unicode instance containing a JSON document) to a Python object. JSONDecodeError will be raised if the given JSON...
Read more >
Using JSON in Go: A guide with examples - LogRocket Blog
For example, we often have to read JSON files to populate Go objects ... The Product struct has a Seller struct as a...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found