Errors While Decoding Response Text Using mitmdump
See original GitHub issueProblem Description
When I visit websites with Arabic characters using a small addon script with mitmdump, and extract the response text I get the following error:
Traceback (most recent call last): File "main.py", line 36, in response response_text = flow.response.text File "c:\users\evead-61\appdata\local\programs\python\python38\lib\site-packages\mitmproxy\net\http\message.py", line 232, in get_text return cast(str, encoding.decode(content, enc)) File "c:\users\evead-61\appdata\local\programs\python\python38\lib\site-packages\mitmproxy\net\http\encoding.py", line 76, in decode raise ValueError("{} when decoding {} with {}: {}".format( ValueError: UnicodeDecodeError when decoding b'GIF89a\x with 'UTF-8': UnicodeDecodeError('utf-8', b'GIF89a\x01\x00\x01\x00\xf0\x00\x00\x00\x00\x00\x00\x00\x00!\xf9\x04\x01\x00\x00\x00\x00,\x00\x00\x00\x00\x01\x00\x01\x00\x00\x02\x02D\x01\x00;
Steps to reproduce the behavior:
- Write a small addon that assigns the HTTPResponse flow response text from the “response()” method
- Assign flow.response.text to a variable
- Run using
mitmdump -s main.py --anticomp
(assuming your file is called main.py) - You can try it on this website chouftv.ma
System Information
Paste the output of “mitmproxy --version” here. Mitmproxy: 6.0.2 Python: 3.8.7 OpenSSL: OpenSSL 1.1.1i 8 Dec 2020 Platform: Windows-10-10.0.17763-SP0
Issue Analytics
- State:
- Created 3 years ago
- Comments:7 (6 by maintainers)
Top GitHub Comments
ouch oof owie my bytes
this comes from the https://collector.githubapp.com/github/page_view tracking pixel
So I guess we need to be more intelligent when doing
guess_encoding
(_get_content_type_charset()
)? @mhilsI was using
and visiting GitHub will cause (on master)
I personally prefer this option since the problem cannot really be solved. Being able to replace stuff inside binary bodies is neat, e.g. search & replace meta data in images or pdfs. And I guess
latin-1
gets that job done and keeping it for backwards compat is nice. I would try to expand our heuristics and add new special cases as we find them. I assume that’s basically what browser vendors do but by now they’ve seen 99.9999% of weird shit.Now the fun begins. We can fall back to latin-1 for
image/*
but not forimage/svg+xml
. Same foraudio/*
,video/*
andapplication/octet-stream
.If we can agree that this is a valid solution I’ll grab a list of common mime types and improve the guessing we currently have.