merge rexport, pushshift and gdpr reddit data
See original GitHub issueWas able to get pushshift
as mentioned in the README working to export old comments. Thought I’d mention it here.
It’s possible to use pushshift to get data further back, but I’m not sure if it should be part of this project, since some of the older comments don’t have the same JSON structure, I can only assume pushshift is getting data from multiple sources the further it goes back. It requires some normalization, like here and here.
The only data I was missing due to the 1000 limit on queries from using rexport
were comments. It exported the last 1000 but I have about 5000 on reddit in total.
Regarding HPI:
Wrote a simple package to request/save that data, with a dal
(whose PComment
NamedTuple has similar @property
attributes to rexports
DAL), and a merge function, and now:
In [15]: from my.reddit import comments, _dal, pushshift_comments
In [16]: len(list(_dal().comments())) # from dal.py in rexport
Out[16]: 999
In [17]: len(list(pushshift_comments())) # from pushshift
Out[17]: 4891
In [18]: len(list(comments())) # merged data, using utc_time to remove duplicates
Out[18]: 4893
In [19]: comments
Out[19]: <function my.reddit.comments() -> Iterator[Union[rexport.dal.Comment, pushshift_comment_export.dal.PComment]]>
Its possible that one could write enough @property
wrappers to handle the differences in the JSON representations of old pushshift data, unsure if thats something you want to pursue here.
Issue Analytics
- State:
- Created 3 years ago
- Reactions:1
- Comments:25 (25 by maintainers)
Top GitHub Comments
Will leave this open for now, since
my.reddit.gdpr
is yet to be written, but a good chunk of this has been added in #179I’ve totally missed your message from April about official reddit GDPR, thanks that’s exciting! Requested mine as well.
Yeah I see the problem with missing attributes… I guess makes sense to discuss here, since it’s kind of a generic discussion about combining data sources.
I see also options
Considering neither of these options is particularly great, it could be configurable (fits nicely into allowing different ‘defensiveness policies’ for different modules). But I guess it makes it pretty complicated too, and unclear if there are any real benefits… So what you suggested with combining different sources for different data makes sense for now, and we can change it later if we come up to something better.