Future of RipMe: separation of program + ripper logicSee original GitHub issue
@4pr0n seems to not be maintaining this project on a regular basis anymore. It’s totally understandable. I myself haven’t had any time lately to dedicate to any side hobby projects, let alone this one. For the most part, RipMe still continues to work for me. (I still use it multiple times a week.)
As discussed in #247 it seems to many people feel like the project is dead – and/or for their scenarios, it is not working as well any more, and their scenarios are not being updated to fix the problems.
@Wiiplay123 @i-cant-git - since you two have also been contributors to the project and commented on #247, I wonder if we can discuss a potential future for this project? Also if you could also share if you know of any currently-maintained projects that can be used as alternatives?
I think this style of catch-all project is the sort of thing that is unmaintainable in the long term except by a seriously dedicated effort. The problem is that there’s essentially no limit to the number of rippers that could be included in this project’s source code. Things have gotten really bloated here, and everyone is depending on official updates from a single source to add new rippers. It’s hard to know how to prioritize maintenance.
Questions like what rippers are people using most can only be answered by how loudly people complain about the broken ones. There’s a lot of more obscure sites that are supported in-box with RipMe (I contributed some of them), and maybe some of the more common ones go by the wayside when trying to support so many.
I’ve been thinking lately that this project is really in two distinct parts:
- There’s the core of the project which provides the structure, interface, and framework to use to define the rippers.
- There’s the rippers themselves, which all more or less follow a consistent algorithm of starting at a URL, navigating through some HTML, extracting image links from the HTML, and queuing those images to be ripped.
I’ve been thinking it might be a worthwhile effort to separate the two concerns. Keep all of part #1 in the same repo, and expose rippers as a plug-in model. Move the rippers into another repo, maybe keep just the core of the rippers maintained by the main project (in a separate), and make it easy for users to define their own locally on their machine. Add a way to add new ripper sources (github repos, local sources, links to individual definition files) in the RipMe UI.
Pages that host image content generally look like one of the following:
- Images are embedded directly in the page (action: download the embedded images)
- Thumbnails link to full-size images (action: download the images at the links)
- Thumbnails link to another page like 1 (action: load the linked page and then use action 1)
- Thumbnails link to another gallery page like 2 or 3 (action: load the page and then use action 2 or 3).
- Thumbnails link through an ad-wall which redirects to an image or a page like 1 or 2 (I’m not sure if we currently have any rippers which automate getting through the ad wall)
The sites we are interested in making rippers for are either one of the above, or a social style of website where users aggregate content by linking to pages like the above.
For sites formatted like 1 and 2 (example: 4chan threads are formatted like 2), AND where all content is on a single page (whether the content is embedded or linked), there are already many tools which download content from any arbitrary website (no specific ripper logic would really be needed in that case, and actually significantly restricts the usefulness of the ripper). Here’s a recommendation for a download manager that can deal with that kind of website (Firefox only, unfortunately, but since RipMe users are using an external program to do our image downloads, I’m sure that’s okay): http://www.downthemall.net/
For me, that covers a lot of sites I’m interested in that don’t already have rippers, and also covers a lot of sites that do already have rippers. In that case, the rippers are probably redundant.
For sites like 1 and 2 where all the content isn’t on a single page, we need to supply some logic to navigate from page to page, and otherwise, the generic techniques for 1 and 2 can be automatically applied once we get to a page where they apply.
Sometimes it is possible to construct the URL of the image from the thumbnail in a gallery in style 3. The e-hentai ripper is an example of this. Following a technique like that saves us from loading a ton of additional pages, which saves time and keeps the ripper from getting blocked because it made too many requests in a short period of time (DDOS detection or REST API limiting).
One place a program like this helps a lot is for sites like Tumblr and Instagram that deliberately make it difficult for a user to download the content by either blocking right-clicking or by obscuring the image in the web page somehow that makes it difficult or impossible to right-click and save. But, because those images are downloaded into the page, it is possible for us to get those links and download them to save on the user’s computer. The location to find the URL on the page is usually easily extracted with some simple HTML-traversal logic. This is the sort of automation we strive to allow with RipMe.
I think the biggest use-case for this application is mostly for websites that host community-generated content in large or even indefinitely-sized albums, especially when that content is spread out over many pages: Reddit (subreddits, user profiles), Imgur (mainly because of heavy use in Reddit), Tumblr, Instagram.
Those are just some thoughts. There’s likely to be more.
Summary of action items:
To reduce Ripper maintenance, enable automatic detection of page styles 1 and 2 and do the right thing in those cases. Then, remove rippers with only that basic logic. Possibly, add a whitelist of URL patterns known to be page styles 1 and 2, so that the user never needs to know there’s no longer a dedicated Ripper for those pages.
Page styles 3 and 4 could be automatically detected and ripped, but we should be careful to add delays so that the Ripper doesn’t get blocked for requesting too many pages at once. Rippers that use the gallery to deduce the actual image URLs should be kept. This style of ripper logic would likely be easy to encode as a simple RegEx like
s/(.*foobar\.com.*)\/thumb(\/.*\.jpg)/\1\2/ – remove the
/thumb/ from the path.
Page style 5 would be difficult to detect automatically without trying to navigate the pages but we might be able to add logic to automatically click through different kinds of ad-walls like adf.ly. Once we get to the other side of the ad-wall we could try to automatically detect the type of page and do the right thing.
After that, any Rippers which meaningfully improve performance or reliability could be added to one of the Ripper galleries (either the separate repo maintained by the RipMe project maintainers, or a third party ripper repo).
Even for the well-known gallery types, it’s still nice to automatically detect an album name from the page. I think currently, we don’t ever try to detect an album name and instead let the ripper decide it. Still nice to let the ripper decide if it wants to, but detecting the title from the page would be nice as well, as long as we’re going ahead with automatic detection.
- Created 6 years ago
Top GitHub Comments
Additionally, I realized that if we release at least some rippers via separate repos, especially as non-compiled scripting or spec code, we don’t have to re-release (and force users to re-install) a new version of RipMe for every small fix or update to the rippers. We just download the new ripper definitions and continue with the same version of the software.
I’d propose moving to version 2.0 if we refactor the API this way, and make sure that we use SemVer with respect to the ripper definition interface to ensure compatibility of rippers with a particular version of RipMe.
@ravenstorm767 - There’s nothing to worry about. The program would work the same way as before, with some new features, as described above. Anything would either just work, or would work like installing a plug-in, to support new websites.
I do hear you that making the project more complicated for the users is a non-goal. I’ll reconsider some of what I’ve proposed to ensure that the project stays simple and easy to use.
The main motivation here is to reorganize the code to make things a bit easier to maintain. As you’ve probably noticed, there’s a lot of interest in a large number of websites for this program to support, and a serious lack of man-hours to support this project. The less that needs maintaining in the core project, the easier it will be on the maintainers.
If we could make a large number of websites “just work” without specific support, that would be a huge step forward in reducing the cost to maintain this project, with no loss of features.
Until I get started, I won’t know how feasible these changes will be. Now that the project has an additional maintainer, it will be much easier to keep the project healthy.