Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

merge rexport, pushshift and gdpr reddit data

See original GitHub issue

Was able to get pushshift as mentioned in the README working to export old comments. Thought I’d mention it here.

It’s possible to use pushshift to get data further back, but I’m not sure if it should be part of this project, since some of the older comments don’t have the same JSON structure, I can only assume pushshift is getting data from multiple sources the further it goes back. It requires some normalization, like here and here.

The only data I was missing due to the 1000 limit on queries from using rexport were comments. It exported the last 1000 but I have about 5000 on reddit in total.

Regarding HPI:

Wrote a simple package to request/save that data, with a dal (whose PComment NamedTuple has similar @property attributes to rexports DAL), and a merge function, and now:

In [15]: from my.reddit import comments, _dal, pushshift_comments

In [16]: len(list(_dal().comments())) # from dal.py in rexport
Out[16]: 999

In [17]: len(list(pushshift_comments())) # from pushshift
Out[17]: 4891

In [18]: len(list(comments())) # merged data, using utc_time to remove duplicates
Out[18]: 4893

In [19]: comments
Out[19]: <function my.reddit.comments() -> Iterator[Union[rexport.dal.Comment, pushshift_comment_export.dal.PComment]]>

Its possible that one could write enough @property wrappers to handle the differences in the JSON representations of old pushshift data, unsure if thats something you want to pursue here.

Issue Analytics

State:
Created 3 years ago
Reactions:1
Comments:25 (25 by maintainers)

Top GitHub Comments

1reaction

seanbreckenridgecommented, Oct 31, 2021

Will leave this open for now, since my.reddit.gdpr is yet to be written, but a good chunk of this has been added in #179

1reaction

karlicosscommented, Oct 28, 2021

I’ve totally missed your message from April about official reddit GDPR, thanks that’s exciting! Requested mine as well.

Yeah I see the problem with missing attributes… I guess makes sense to discuss here, since it’s kind of a generic discussion about combining data sources.

I see also options

throw on trying to access missing attributes (e.g. a Save from gdpr source would throw on trying to access title This is type safe but might be unexpected and annoying to handle downstream. E.g. hpi query/hpi stat might start failing while trying to order by timestamp
return default value for missing attributes
- we could still return None (even despite mypy annotation), but I guess this could also be annoying to handle downstream
- for many fields could return some type-safe default value… but it might end up pretty confusing

Considering neither of these options is particularly great, it could be configurable (fits nicely into allowing different ‘defensiveness policies’ for different modules). But I guess it makes it pretty complicated too, and unclear if there are any real benefits… So what you suggested with combining different sources for different data makes sense for now, and we can change it later if we come up to something better.