Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Best way to open twarc json (flatten and not) in R? There might be problems with search json format

See original GitHub issue

As you know I am working in the analysis of links in tweets gathered with twarc in json files (https://github.com/DocNow/twarc/wiki/Analyzing-links-in-tweets). I am having some problems reading directly from json

I’ve used a quick trick to make it possible: I use a head(n) to only use the first n tweets:

fn <- "file_flatten.json"
wrongjson <- paste(readLines(fn) %>% head(100000), collapse = "")

If I don’t filter by the first lines I get this error:

Error in gsub("\\}\\s*\\{", "},{", wrongjson) : 
  'Calloc' could not allocate memory (18446744071562067968 of 1 bytes)

Then I build, again, the (unflatten) json file:

if (grepl("\\}\\s*\\{", wrongjson)) {
  wrongjson <- paste0("[", gsub("\\}\\s*\\{", "},{", wrongjson), "]")
}
json <- jsonlite::fromJSON(wrongjson, simplifyDataFrame = FALSE)

If I try to use the the original json file, unflatten I get this error. This only happens for big files:

json <- jsonlite::fromJSON( "file.json", simplifyDataFrame = FALSE)
Error in parse_con(txt, bigint_as_char) : parse error: trailing garbage
          2021-04-25T15:34:51+00:00"}} {"possibly_sensitive": false, "
                     (right here) ------^

If I look what is going on in the json file I see.

What I see in a small search file is a good formatted json file:

{
 "data": [
  {
  "text": "text of tweet",
  "referenced_tweets": [...],
  "lang": "es",
  "public_metrics": {...},
  ...
  },
  { ... another tweet ... },
  { ... another tweet ... }
 ],
 "includes": {
  "users": [...],
  "tweets": [
   {
     "text": "text of tweet",
     "lang": "es",
     ...
   },
   { ... another tweet ... },
   { ... another tweet ... }
  ],
  "media": [ ...]
 },
 "meta": {...},
 "__twarc": {...},
}

What I see in a big search json file are these 3 lines, which seems weird, and might be causing the error, it seems more like a jsonl:

{
 "data": [...],
 "includes": {...},
 "errors"; [...],
 "meta": {...},
 "__twarc": {...}
}
{
 "data": [...],
 "includes": {...},
 "meta": {...},
 "__twarc": {...}
}
{
 "data": [...],
 "includes": {...},
 "meta": {...},
 "__twarc": {...}
}

The first line, when the json is propperly formated goes from 1 to 12,397 line, the second from 12,898 to 23,921 and the third section 23,922 to 26,520.

What is the expected format of the json original file after twarc2 search word > file.json?

Issue Analytics

State:
Created 2 years ago
Comments:8 (6 by maintainers)

Top GitHub Comments

2reactions

SamHamescommented, Sep 27, 2021

To make things more concrete, sorry if I wasn’t clear earlier:

The result of (almost) all twarc CLI operations is a jsonl file with one line per api call response. Each response (line in the file) is a complete Json document containing the data for a number of tweets. The tweets themselves are the array under the ‘data’ key in each response document. So to iterate through the file, you retrieve the data key for that line, then iterate through that array.

You can check out the code in that does flattening in twarc, and also in the twarc-csv plugin for more examples (I think the ensure_flattened function is what you want in twarc/expansions.py, but I’m writing from my phone).

As a side note a colleague has also started some work on processing the tweets into an Sqlite database so you can just write a query to bring out what you need.

On Mon, 27 Sep 2021, 21:59 numeroteca, @.***> wrote:

So, the result of a search file is a jsonl with only 1 line (that has many tweets in it) and inserts new lines if more tweets are gathered?

If I can understand how the json is built I can iterate and parse the json. Maybe it is documented somewhere and I haven’t seen it.

I am working with 200,000-300,000 tweets files, they could be bigger.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/DocNow/twarc/issues/547#issuecomment-927802132, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACADAUPR2IQIKBODA5RVOSTUEBMAFANCNFSM5E2G7KOQ . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

1reaction

numerotecacommented, Sep 27, 2021

I am using Linux (unbuntu). I create and analyze everything in the same OS.

Not sure I understand well what you say regarding new lines. I still don’t get why there are multiple lines in the json file.

Top Results From Across the Web

twarc2 search without configure on Windows throws JSON ...

I don't know how feasible it is, but it might be nice to be able to run twarc2 like this: python -m twarc.command2...

twarc2 (en) - twarc

twarc2 is a command line tool and Python library for archiving Twitter JSON data. Each tweet is represented as a JSON object that...

Question about converting Twitter Search API JSON file to CSV

Need to convert a JSON file to CSV, and the JSON file was downloaded from the Twitter Search API using twarc2. Twarc search...

Harvesting Twitter Data with twarc - The Carpentries Incubator

Navigate to the 'Issues' page on their repository and open a new issue. ... JSONLs better for collecting twitter data, since JSON's are...

Flattening JSON objects in Python - GeeksforGeeks

There is one recursive way and another by using the json-flatten library. Approach 1: Recursive Approach. Now we can flatten the dictionary ...

Troubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.

Start Free

Top Related Reddit Thread

No results found

Top Related Tweet

No results found

Top Related Dev.to Post

No results found

Best way to open twarc json (flatten and not) in R? There might be problems with search json format

Issue Analytics

Top GitHub Comments

Top Results From Across the Web

Top Related Medium Post

Top Related StackOverflow Question

Troubleshoot Live Code

Top Related Reddit Thread

Top Related Hackernoon Post

Top Related Tweet

Top Related Dev.to Post

Top Related Hashnode Post

Point Radius search verification

twarc2 Broken Data (search query)