question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

"Accept-Encoding" is preventing content from being automatically decompressed

See original GitHub issue

I’ve found a case where I need to specify the “accept-encoding” header in order to correctly access the content I’m attempting to scrape (without the header the site is presenting a bot detection captcha).

Example of the lua script I’m passing to the execute api:

function main(splash)
  local url = splash.args.url
  splash:set_custom_headers({
     ["Connection"] = "keep-alive",
     ["Accept"] = "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8",      
     ["Accept-Encoding"] = "gzip, deflate, sdch",
     ["Accept-Language"] = "en-US,en;q=0.8",
  })
  assert(splash:go(url))
  assert(splash:wait(3.0))
  return {
    html = splash:html(),
  }
end

It appears that there is an underlying issue that is preventing the content from automatically being decompressed: https://github.com/scrapinghub/splash/blob/master/splash/proxy_server.py#L90

Is there a workaround to force decompression with the existence of the accept-encoding header?

Issue Analytics

  • State:open
  • Created 8 years ago
  • Reactions:4
  • Comments:20 (3 by maintainers)

github_iconTop GitHub Comments

6reactions
wflanagancommented, May 22, 2018

Why should this be a priority to fix?

  1. There are websites where Splash with ACCEPT_ENCODING = “identity” setting doesn’t return anything, the browser just hangs. I’m not sure if this is because the web server doesn’t have an internal cache and takes longer than 30 seconds (the default timeout) to render, or if its an active crawl avoidance technique.

  2. Common browsers default to an ACCEPT-ENCODING that has a mix of: br, gzip, and deflate. Not having these same settings increases “visibility” of our proxy as not a standard browser. Safari: br, gzip, deflate, Chrome: gzip, deflate, br, Firefox: gzip, deflate, br.

  3. Not using compressed documents slows transfer times and the overall throughput of the system. This is particularly true if you’re using a proxy, as there is a “double transfer” of the html from the proxy to Splash, and then from Splash to the client.

  4. It seems what IS being transferred to the end client is not decompressable/inflatable using (at least in the Ruby case) any of the standard libraries for inflating compressed HTML pages. I tried to fix this by taking what Splash delivers, and handling it myself. But, it seems that the resulting string is not compatible with any GZip or Brotli decompressor I can find in my native language (Ruby, and yes, I tried more than 1). My actual client uses Faraday as a wrapper for HTTP requests, and the GZip that is used in Faraday has 0 problems directly handling these sites outside of Splash.

Net is that I’m somewhat stuck. Help!

1reaction
blablaciocommented, Aug 14, 2019

@StasDeep And it’s actually a bug (or rather a feature) in Qt. See here.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Is GZIP Automatically Decompressed by Browser?
That tells the client, or browser, that the content is encoded using gzip compression and it should decompress it appropriately. All browser based...
Read more >
Gitlab::HttpIO does not decompress responses sent with ...
HttpIO responses using gzip compression are not decompressed. What is the expected correct behavior? The gzipped content is decompressed before ...
Read more >
dataTaskWithRequest problem - can't download gzip
Hi everyone,. I have a problem while trying to session.dataTaskWithRequest in a function to read gzip from URL. The server side changes ".gzip"...
Read more >
Response compression in ASP.NET Core | Microsoft Learn
The result of returning content with the Vary: Accept-Encoding header is that both compressed and uncompressed responses are cached ...
Read more >
Accept-Encoding - HTTP - MDN Web Docs
The Accept-Encoding request HTTP header indicates the content encoding (usually a compression algorithm) that the client can understand.
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found