Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Errors While Decoding Response Text Using mitmdump

See original GitHub issue

Problem Description

When I visit websites with Arabic characters using a small addon script with mitmdump, and extract the response text I get the following error:

Traceback (most recent call last): File "main.py", line 36, in response response_text = flow.response.text File "c:\users\evead-61\appdata\local\programs\python\python38\lib\site-packages\mitmproxy\net\http\message.py", line 232, in get_text return cast(str, encoding.decode(content, enc)) File "c:\users\evead-61\appdata\local\programs\python\python38\lib\site-packages\mitmproxy\net\http\encoding.py", line 76, in decode raise ValueError("{} when decoding {} with {}: {}".format( ValueError: UnicodeDecodeError when decoding b'GIF89a\x with 'UTF-8': UnicodeDecodeError('utf-8', b'GIF89a\x01\x00\x01\x00\xf0\x00\x00\x00\x00\x00\x00\x00\x00!\xf9\x04\x01\x00\x00\x00\x00,\x00\x00\x00\x00\x01\x00\x01\x00\x00\x02\x02D\x01\x00;

Steps to reproduce the behavior:

Write a small addon that assigns the HTTPResponse flow response text from the “response()” method
Assign flow.response.text to a variable
Run using mitmdump -s main.py --anticomp (assuming your file is called main.py)
You can try it on this website chouftv.ma

System Information

Paste the output of “mitmproxy --version” here. Mitmproxy: 6.0.2 Python: 3.8.7 OpenSSL: OpenSSL 1.1.1i 8 Dec 2020 Platform: Windows-10-10.0.17763-SP0

Issue Analytics

State:
Created 3 years ago
Comments:7 (6 by maintainers)

Top GitHub Comments

2reactions

Prinzhorncommented, Jan 27, 2021

ouch oof owie my bytes

content-type: image/gif; charset=utf-8

this comes from the https://collector.githubapp.com/github/page_view tracking pixel

Selection_795

So I guess we need to be more intelligent when doing guess_encoding (_get_content_type_charset())? @mhils

I was using

def response(flow):
    print(flow.request.url)
    print(len(flow.response.text))

and visiting GitHub will cause (on master)

Addon error: Traceback (most recent call last):
  File "/home/alex/Projects/super-top-secret/src/forks/mitmproxy/mitmproxy/net/http/encoding.py", line 67, in decode
    decoded = custom_decode[encoding](encoded)
KeyError: 'utf-8'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/alex/Projects/super-top-secret/src/forks/mitmproxy/mitmproxy/net/http/encoding.py", line 69, in decode
    decoded = codecs.decode(encoded, encoding, errors)  # type: ignore
  File "/usr/lib/python3.8/encodings/utf_8.py", line 16, in decode
    return codecs.utf_8_decode(input, errors, True)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x80 in position 10: invalid start byte

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/alex/Projects/super-top-secret/src/issues/mitmproxy-4415/main.py", line 3, in response
    print(len(flow.response.text))
  File "/home/alex/Projects/super-top-secret/src/forks/mitmproxy/mitmproxy/net/http/message.py", line 244, in get_text
    return cast(str, encoding.decode(content, enc))
  File "/home/alex/Projects/super-top-secret/src/forks/mitmproxy/mitmproxy/net/http/encoding.py", line 76, in decode
    raise ValueError("{} when decoding {} with {}: {}".format(
ValueError: UnicodeDecodeError when decoding b'GIF89a\x with 'utf-8': UnicodeDecodeError('utf-8', b'GIF89a\x01\x00\x01\x00\x80\xff\x00\xff\xff\xff\x00\x00\x00,\x00\x00\x00\x00\x01\x00\x01\x00\x00\x02\x02D\x01\x00;', 10, 11, 'invalid start byte')

0reactions

Prinzhorncommented, Feb 27, 2021

Do a better job at guessing. TL;DR, this may fix some occasions, but doesn’t solve the problem. What’s the proper encoding of the tracking pixel above? “binary” is not a valid encoding.

I personally prefer this option since the problem cannot really be solved. Being able to replace stuff inside binary bodies is neat, e.g. search & replace meta data in images or pdfs. And I guess latin-1 gets that job done and keeping it for backwards compat is nice. I would try to expand our heuristics and add new special cases as we find them. I assume that’s basically what browser vendors do but by now they’ve seen 99.9999% of weird shit.

Now the fun begins. We can fall back to latin-1 for image/* but not for image/svg+xml. Same for audio/*, video/* and application/octet-stream.

If we can agree that this is a valid solution I’ll grab a list of common mime types and improve the guessing we currently have.