question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Best way to open twarc json (flatten and not) in R? There might be problems with search json format

See original GitHub issue

As you know I am working in the analysis of links in tweets gathered with twarc in json files (https://github.com/DocNow/twarc/wiki/Analyzing-links-in-tweets). I am having some problems reading directly from json

I’ve used a quick trick to make it possible: I use a head(n) to only use the first n tweets:

fn <- "file_flatten.json"
wrongjson <- paste(readLines(fn) %>% head(100000), collapse = "")

If I don’t filter by the first lines I get this error:

Error in gsub("\\}\\s*\\{", "},{", wrongjson) : 
  'Calloc' could not allocate memory (18446744071562067968 of 1 bytes)

Then I build, again, the (unflatten) json file:

if (grepl("\\}\\s*\\{", wrongjson)) {
  wrongjson <- paste0("[", gsub("\\}\\s*\\{", "},{", wrongjson), "]")
}
json <- jsonlite::fromJSON(wrongjson, simplifyDataFrame = FALSE)

If I try to use the the original json file, unflatten I get this error. This only happens for big files:

json <- jsonlite::fromJSON( "file.json", simplifyDataFrame = FALSE)
Error in parse_con(txt, bigint_as_char) : parse error: trailing garbage
          2021-04-25T15:34:51+00:00"}} {"possibly_sensitive": false, "
                     (right here) ------^

If I look what is going on in the json file I see.

What I see in a small search file is a good formatted json file:

{
 "data": [
  {
  "text": "text of tweet",
  "referenced_tweets": [...],
  "lang": "es",
  "public_metrics": {...},
  ...
  },
  { ... another tweet ... },
  { ... another tweet ... }
 ],
 "includes": {
  "users": [...],
  "tweets": [
   {
     "text": "text of tweet",
     "lang": "es",
     ...
   },
   { ... another tweet ... },
   { ... another tweet ... }
  ],
  "media": [ ...]
 },
 "meta": {...},
 "__twarc": {...},
}

What I see in a big search json file are these 3 lines, which seems weird, and might be causing the error, it seems more like a jsonl:

{
 "data": [...],
 "includes": {...},
 "errors"; [...],
 "meta": {...},
 "__twarc": {...}
}
{
 "data": [...],
 "includes": {...},
 "meta": {...},
 "__twarc": {...}
}
{
 "data": [...],
 "includes": {...},
 "meta": {...},
 "__twarc": {...}
}

The first line, when the json is propperly formated goes from 1 to 12,397 line, the second from 12,898 to 23,921 and the third section 23,922 to 26,520.

What is the expected format of the json original file after twarc2 search word > file.json?

Issue Analytics

  • State:closed
  • Created 2 years ago
  • Comments:8 (6 by maintainers)

github_iconTop GitHub Comments

2reactions
SamHamescommented, Sep 27, 2021

To make things more concrete, sorry if I wasn’t clear earlier:

The result of (almost) all twarc CLI operations is a jsonl file with one line per api call response. Each response (line in the file) is a complete Json document containing the data for a number of tweets. The tweets themselves are the array under the ‘data’ key in each response document. So to iterate through the file, you retrieve the data key for that line, then iterate through that array.

You can check out the code in that does flattening in twarc, and also in the twarc-csv plugin for more examples (I think the ensure_flattened function is what you want in twarc/expansions.py, but I’m writing from my phone).

As a side note a colleague has also started some work on processing the tweets into an Sqlite database so you can just write a query to bring out what you need.

On Mon, 27 Sep 2021, 21:59 numeroteca, @.***> wrote:

So, the result of a search file is a jsonl with only 1 line (that has many tweets in it) and inserts new lines if more tweets are gathered?

If I can understand how the json is built I can iterate and parse the json. Maybe it is documented somewhere and I haven’t seen it.

I am working with 200,000-300,000 tweets files, they could be bigger.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/DocNow/twarc/issues/547#issuecomment-927802132, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACADAUPR2IQIKBODA5RVOSTUEBMAFANCNFSM5E2G7KOQ . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

1reaction
numerotecacommented, Sep 27, 2021

I am using Linux (unbuntu). I create and analyze everything in the same OS.

Not sure I understand well what you say regarding new lines. I still don’t get why there are multiple lines in the json file.

Read more comments on GitHub >

github_iconTop Results From Across the Web

twarc2 search without configure on Windows throws JSON ...
I don't know how feasible it is, but it might be nice to be able to run twarc2 like this: python -m twarc.command2...
Read more >
twarc2 (en) - twarc
twarc2 is a command line tool and Python library for archiving Twitter JSON data. Each tweet is represented as a JSON object that...
Read more >
Question about converting Twitter Search API JSON file to CSV
Need to convert a JSON file to CSV, and the JSON file was downloaded from the Twitter Search API using twarc2. Twarc search...
Read more >
Harvesting Twitter Data with twarc - The Carpentries Incubator
Navigate to the 'Issues' page on their repository and open a new issue. ... JSONLs better for collecting twitter data, since JSON's are...
Read more >
Flattening JSON objects in Python - GeeksforGeeks
There is one recursive way and another by using the json-flatten library. Approach 1: Recursive Approach. Now we can flatten the dictionary ...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found