Inject metadata about when a tweet was retrieved
See original GitHub issueOne thing that isn’t present in Twitter API responses is any information about when the tweet was retrieved/the call was made. This is important metadata, and critical to some use cases (for example, if you’re looking at engagement metrics, there’s a big difference between a tweet collected one minute after compare to one day after). This information is currently stored in the collection log, but that isn’t super friendly because any programmatic use requires parsing the log
I propose we inject an ISO8601 string representing the time the call was made into either the existing meta
dictionary of some API responses, or (maybe less likely to collide), a top level key like collection_meta
, or _twarc
. This avoids needing to manage a sidechannel of information such as through the log file, and also stores the data exactly where it’s needed for processing.
Issue Analytics
- State:
- Created 3 years ago
- Reactions:1
- Comments:12 (11 by maintainers)
Top GitHub Comments
Catching up on this, documenting the request also brought to mind the WARC approach we’re using in SFM. We’re still using that approach, but it does indeed introduce overhead.
The WARC request records provide documentation of when the HTTP request was made, which API was called, the query parameters, auth info and more. But I think the info about which API was called and when the WARC was created are the pieces of info we use the most and are most relevant to twarc. So, agreed it is helpful/necessary to know when the request was made, in case you want to only work with data captured in a certain timeframe or want the context of the metrics, as mentioned above.
I was going to say, it’s a slippery slope, but I kinda like it. It almost starts to remind me of work that GWU was doing to write Twitter Data using the WARC format, which wraps the HTTP response in WARC Record which has metadata and is linked to an HTTP Request. That’s definitely one end of the spectrum. I think they switched away from doing that because it made downstream processing more difficult.
I like the idea of a top level property
__twarc
that is an object which contains whatever we deem relevant. Maybe with the version of twarc that was used?