Best way to open twarc json (flatten and not) in R? There might be problems with search json format
See original GitHub issueAs you know I am working in the analysis of links in tweets gathered with twarc in json
files (https://github.com/DocNow/twarc/wiki/Analyzing-links-in-tweets). I am having some problems reading directly from json
I’ve used a quick trick to make it possible:
I use a head(n)
to only use the first n
tweets:
fn <- "file_flatten.json"
wrongjson <- paste(readLines(fn) %>% head(100000), collapse = "")
If I don’t filter by the first lines I get this error:
Error in gsub("\\}\\s*\\{", "},{", wrongjson) :
'Calloc' could not allocate memory (18446744071562067968 of 1 bytes)
Then I build, again, the (unflatten) json file:
if (grepl("\\}\\s*\\{", wrongjson)) {
wrongjson <- paste0("[", gsub("\\}\\s*\\{", "},{", wrongjson), "]")
}
json <- jsonlite::fromJSON(wrongjson, simplifyDataFrame = FALSE)
If I try to use the the original json file, unflatten I get this error. This only happens for big files:
json <- jsonlite::fromJSON( "file.json", simplifyDataFrame = FALSE)
Error in parse_con(txt, bigint_as_char) : parse error: trailing garbage
2021-04-25T15:34:51+00:00"}} {"possibly_sensitive": false, "
(right here) ------^
If I look what is going on in the json
file I see.
What I see in a small search file is a good formatted json
file:
{
"data": [
{
"text": "text of tweet",
"referenced_tweets": [...],
"lang": "es",
"public_metrics": {...},
...
},
{ ... another tweet ... },
{ ... another tweet ... }
],
"includes": {
"users": [...],
"tweets": [
{
"text": "text of tweet",
"lang": "es",
...
},
{ ... another tweet ... },
{ ... another tweet ... }
],
"media": [ ...]
},
"meta": {...},
"__twarc": {...},
}
What I see in a big search json file are these 3 lines, which seems weird, and might be causing the error, it seems more like a jsonl
:
{
"data": [...],
"includes": {...},
"errors"; [...],
"meta": {...},
"__twarc": {...}
}
{
"data": [...],
"includes": {...},
"meta": {...},
"__twarc": {...}
}
{
"data": [...],
"includes": {...},
"meta": {...},
"__twarc": {...}
}
The first line, when the json
is propperly formated goes from 1 to 12,397 line, the second from 12,898 to 23,921 and the third section 23,922 to 26,520.
What is the expected format of the json
original file after twarc2 search word > file.json
?
Issue Analytics
- State:
- Created 2 years ago
- Comments:8 (6 by maintainers)
Top GitHub Comments
To make things more concrete, sorry if I wasn’t clear earlier:
The result of (almost) all twarc CLI operations is a jsonl file with one line per api call response. Each response (line in the file) is a complete Json document containing the data for a number of tweets. The tweets themselves are the array under the ‘data’ key in each response document. So to iterate through the file, you retrieve the data key for that line, then iterate through that array.
You can check out the code in that does flattening in twarc, and also in the twarc-csv plugin for more examples (I think the ensure_flattened function is what you want in twarc/expansions.py, but I’m writing from my phone).
As a side note a colleague has also started some work on processing the tweets into an Sqlite database so you can just write a query to bring out what you need.
On Mon, 27 Sep 2021, 21:59 numeroteca, @.***> wrote:
I am using Linux (unbuntu). I create and analyze everything in the same OS.
Not sure I understand well what you say regarding new lines. I still don’t get why there are multiple lines in the json file.