question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

merge rexport, pushshift and gdpr reddit data

See original GitHub issue

Was able to get pushshift as mentioned in the README working to export old comments. Thought I’d mention it here.

It’s possible to use pushshift to get data further back, but I’m not sure if it should be part of this project, since some of the older comments don’t have the same JSON structure, I can only assume pushshift is getting data from multiple sources the further it goes back. It requires some normalization, like here and here.

The only data I was missing due to the 1000 limit on queries from using rexport were comments. It exported the last 1000 but I have about 5000 on reddit in total.

Regarding HPI:

Wrote a simple package to request/save that data, with a dal (whose PComment NamedTuple has similar @property attributes to rexports DAL), and a merge function, and now:

In [15]: from my.reddit import comments, _dal, pushshift_comments

In [16]: len(list(_dal().comments())) # from dal.py in rexport
Out[16]: 999

In [17]: len(list(pushshift_comments())) # from pushshift
Out[17]: 4891

In [18]: len(list(comments())) # merged data, using utc_time to remove duplicates
Out[18]: 4893

In [19]: comments
Out[19]: <function my.reddit.comments() -> Iterator[Union[rexport.dal.Comment, pushshift_comment_export.dal.PComment]]>

Its possible that one could write enough @property wrappers to handle the differences in the JSON representations of old pushshift data, unsure if thats something you want to pursue here.

Issue Analytics

  • State:open
  • Created 3 years ago
  • Reactions:1
  • Comments:25 (25 by maintainers)

github_iconTop GitHub Comments

1reaction
seanbreckenridgecommented, Oct 31, 2021

Will leave this open for now, since my.reddit.gdpr is yet to be written, but a good chunk of this has been added in #179

1reaction
karlicosscommented, Oct 28, 2021

I’ve totally missed your message from April about official reddit GDPR, thanks that’s exciting! Requested mine as well.

Yeah I see the problem with missing attributes… I guess makes sense to discuss here, since it’s kind of a generic discussion about combining data sources.

I see also options

  • throw on trying to access missing attributes (e.g. a Save from gdpr source would throw on trying to access title This is type safe but might be unexpected and annoying to handle downstream. E.g. hpi query/hpi stat might start failing while trying to order by timestamp
  • return default value for missing attributes
    • we could still return None (even despite mypy annotation), but I guess this could also be annoying to handle downstream
    • for many fields could return some type-safe default value… but it might end up pretty confusing

Considering neither of these options is particularly great, it could be configurable (fits nicely into allowing different ‘defensiveness policies’ for different modules). But I guess it makes it pretty complicated too, and unclear if there are any real benefits… So what you suggested with combining different sources for different data makes sense for now, and we can change it later if we come up to something better.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Is Pushshift.io breaking any laws? : r/gdpr - Reddit
Pushshift is an API that scrapes all reddit data and comments, even deleted user comments, edits, deleted posts, etc.
Read more >
pushshift-comment-export - PyPI
Exports all accessible reddit comments for an account using pushshift.
Read more >
seanbreckenridge/pushshift_comment_export - GitHub
Exports all accessible reddit comments for an account using pushshift ... using the official reddit API (I run rexport periodically to pick up...
Read more >
How to Scrape Large Amounts of Reddit Data | by Matt Podolak
In this article, I'm going to show you how to use Pushshift to scrape a large amount of Reddit data and create a...
Read more >
Is Pushshift.io compliant to GDPR if they refuse to delete ...
In a Reddit post, they say: What happens when a removal request is made? A) Right now, we internally blacklist the account so...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found