[FEATURE] Allow for custom HttpClient implementations
See original GitHub issueIs your feature request related to a problem? Please describe.
I’m always frustrated when I have to let skrape{it} open an OkHttp3
client (via KoHttp
) when I already use ktor clients in the rest of my code for other purposes.
Describe the solution you’d like
Provide an interface (similar to the existing it.skrape.core.fetcher.Fetcher
) that the user has to fulfill in order to use their own Http client implementation
Describe alternatives you’ve considered Fetching the HTML code manually and loading it into skrape{it} as raw String. Works, but feels like a dirty workaround.
Additional context When using the web-scraping functionality of skrape{it} in a bigger project, I already use a (custom configured) ktor client with a connection pool and other fancy stuff. It feels wrong to fire up OkHttp3 only to fetch two website HTML documents.
There already exists a Fetcher
interface, and I believe that if you changed the signature from fun fetch(): Result
to fun fetch(request: Request): Result
that would already be enough to allow for custom client implementations. Perhaps this will also require de-coupling some configuration-specific values like SSL verification and timeouts into a second Configuration
interface because most clients will only require that kind of information once upon creation and not upon every single request.
Then, lastly, the user has to be able to “override” the client
engine with any custom-built adapter in the same skrape {}
block where things like url
and mode
are currently configured. It is unclear how this mechanism will (or won’t) supersede the mode
setting.
Issue Analytics
- State:
- Created 3 years ago
- Comments:5 (3 by maintainers)
What I could also imagine and think would be a nice solution is to pull the clients completely out of the skrape{it} core and just leave an interface that than can be implemented by people or we could just deliver the implementations we already have (okhttp and html-unit) as optional dependencies. The longer I think about it this would currently be my preferred way because after we now have all the jsoup connection nicely separated and if we would pull out none native kotlin stuff (what basically are really only the http client implementations) I should then be possible to may make skrape{it} become a multiplatform Lib someday in the future - which would be really great.
i think we could/should just kill the
mode
setting in that case because the only thing the mode setting is doing is switching between OkHttp client and HtmlUnit client. Other feasible option seams to be having a 3rd mode option called CUSTOM or sth. Best usability would be if the users would have the possibility to pass the client implementation they want (either their own or pre-configured once like theHttpFetcher
or theBrowserFetcher
). i would like to still ship theHttpFetcher
andBrowserFetcher
for people who don’t have the use-case to implement the http client themselfs and to allow an usage that is as easy and smooth as possible. I think especially the JS excution support that comes with theBrowserFetcher
(HtmlUnit) is making skrape{it} unique but maybe it would need to get a more applicable name.Because this is a really fundamentally decision regarding the design and usage of the library i added this issue to be part of milestone 1.0.0 - if we would ship the 1.0.0 final version without this feature it has potential to imply breaking changes what we should avoid after the first final release.
if this one and the both issues regarding the matchers (that are basically just about replacing strikt from our src/main packages) are done (i will do it as soon as possible) we are good to go for the first final 1.0.0 release 💯 this makes me really happy 😃