Gzip encoding
See original GitHub issueBug Report
Prerequisites
- [Y ] Can you reproduce the problem in a MWE?
- [ Y] Are you running the latest version of AngleSharp?
- [?] Did you check the FAQs to see if that helps you?
- [Y] Are you reporting to the correct repository? (there are multiple AngleSharp libraries, e.g.,
AngleSharp.Css
for CSS support) - [Y] Did you perform a search in the issues?
For more information, see the CONTRIBUTING
guide.
Description
I’m seeing the same issue as #416 - downloaded page is gzip encoded and AngleSharp is not decompressing it.
Steps to Reproduce
Does not work:
var config = Configuration.Default.WithLocaleBasedEncoding().WithDefaultLoader();
var address = "https://www.powerball.com";
var context = BrowsingContext.New(config);
var document = await context.OpenAsync(address);
Expected behavior: [What you expected to happen]
Document should be decompressed plain text.
Actual behavior:
Document is “garbled” gzipped content.
Environment details: [OS, .NET Runtime, …]
Windows 11 x64, .NET 7.0.302, AngleSharp 1.0.3
Possible Solution
Works:
HttpClientHandler handler = new HttpClientHandler()
{
AutomaticDecompression = DecompressionMethods.GZip | DecompressionMethods.Deflate
};
var http = new HttpClient(handler);
var body = await http.GetStringAsync("https://www.powerball.com");
var config = Configuration.Default;
var context = BrowsingContext.New(config);
var document = await context.OpenAsync(req => req.Content(body));
Working around this by using HTTP Client. Is this the recommended approach? I searched for “org:AngleSharp gzip” and didn’t see any recommendations or FAQ guidance. I assume this should be out-of-box automatic behavior, so maybe I’m missing something.
Issue Analytics
- State:
- Created 3 months ago
- Comments:5 (4 by maintainers)
Top Results From Across the Web
Content-Encoding - HTTP - MDN Web Docs
Content encoding is mainly used to compress the message data without losing information about the origin media type. Note that the original ...
Read more >Transfer-Encoding: gzip vs. Content-Encoding: gzip
Content -encoding refers to the content encoding on the server in the abstract, i.e. the content will consistently be served in specified ...
Read more >gzip - Wikipedia
gzip is a file format and a software application used for file compression and decompression. The program was created by Jean-loup Gailly and...
Read more >gzip — Support for gzip files — Python 3.11.4 documentation
Source code: Lib/gzip.py This module provides a simple interface to compress and ... gzip.open(filename, mode='rb', compresslevel=9, encoding=None, ...
Read more >How To Optimize Your Site With GZIP Compression
The header “Content-encoding: gzip” means the contents were sent compressed. chrome gzip header. Click the “Use large rows” icon to get more details,...
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
By the way I could finally reproduce this. The server is sometimes returning Brotli (“br”) compressed responses - even though the
Accept-Encoding
header tells it that only “deflate” and “gzip” are supported.In such case we’ll now throw an exception. The response would have been gibberish anyway and this way one can react. Should be a rare case though - this is certainly a problem on the webserver.
Yes, the requester coming with AngleSharp is not using the HttpClient and should only be used in simple cases. If you heavily rely on IO then you should use AngleSharp.Io.
I’ll see if this is a general problem with the requester (not being able to process gzip) or if this is something with the page. If its a general problem then we need to drop the “gzip” from accepted encodings.