"Accept-Encoding" is preventing content from being automatically decompressed
See original GitHub issueI’ve found a case where I need to specify the “accept-encoding” header in order to correctly access the content I’m attempting to scrape (without the header the site is presenting a bot detection captcha).
Example of the lua script I’m passing to the execute api:
function main(splash)
local url = splash.args.url
splash:set_custom_headers({
["Connection"] = "keep-alive",
["Accept"] = "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8",
["Accept-Encoding"] = "gzip, deflate, sdch",
["Accept-Language"] = "en-US,en;q=0.8",
})
assert(splash:go(url))
assert(splash:wait(3.0))
return {
html = splash:html(),
}
end
It appears that there is an underlying issue that is preventing the content from automatically being decompressed: https://github.com/scrapinghub/splash/blob/master/splash/proxy_server.py#L90
Is there a workaround to force decompression with the existence of the accept-encoding header?
Issue Analytics
- State:
- Created 8 years ago
- Reactions:4
- Comments:20 (3 by maintainers)
Top Results From Across the Web
Is GZIP Automatically Decompressed by Browser?
That tells the client, or browser, that the content is encoded using gzip compression and it should decompress it appropriately. All browser based...
Read more >Gitlab::HttpIO does not decompress responses sent with ...
HttpIO responses using gzip compression are not decompressed. What is the expected correct behavior? The gzipped content is decompressed before ...
Read more >dataTaskWithRequest problem - can't download gzip
Hi everyone,. I have a problem while trying to session.dataTaskWithRequest in a function to read gzip from URL. The server side changes ".gzip"...
Read more >Response compression in ASP.NET Core | Microsoft Learn
The result of returning content with the Vary: Accept-Encoding header is that both compressed and uncompressed responses are cached ...
Read more >Accept-Encoding - HTTP - MDN Web Docs
The Accept-Encoding request HTTP header indicates the content encoding (usually a compression algorithm) that the client can understand.
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
Why should this be a priority to fix?
There are websites where Splash with ACCEPT_ENCODING = “identity” setting doesn’t return anything, the browser just hangs. I’m not sure if this is because the web server doesn’t have an internal cache and takes longer than 30 seconds (the default timeout) to render, or if its an active crawl avoidance technique.
Common browsers default to an ACCEPT-ENCODING that has a mix of: br, gzip, and deflate. Not having these same settings increases “visibility” of our proxy as not a standard browser. Safari: br, gzip, deflate, Chrome: gzip, deflate, br, Firefox: gzip, deflate, br.
Not using compressed documents slows transfer times and the overall throughput of the system. This is particularly true if you’re using a proxy, as there is a “double transfer” of the html from the proxy to Splash, and then from Splash to the client.
It seems what IS being transferred to the end client is not decompressable/inflatable using (at least in the Ruby case) any of the standard libraries for inflating compressed HTML pages. I tried to fix this by taking what Splash delivers, and handling it myself. But, it seems that the resulting string is not compatible with any GZip or Brotli decompressor I can find in my native language (Ruby, and yes, I tried more than 1). My actual client uses Faraday as a wrapper for HTTP requests, and the GZip that is used in Faraday has 0 problems directly handling these sites outside of Splash.
Net is that I’m somewhat stuck. Help!
@StasDeep And it’s actually a bug (or rather a feature) in
Qt
. See here.