question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. ItĀ collects links to all the places you might be looking at while hunting down a tough bug.

And, if youā€™re still stuck at the end, weā€™re happy to hop on a call to see how we can help out.

`UnicodeDecodeError` if commit messages contain Unicode characters

See original GitHub issue

Description

If I run

cz changelog

and the commit messages contain Unicode characters like šŸ¤¦šŸ»ā€ā™‚ļø (which is an eight-byte utf-8 sequence: \xf0\x9f\xa4\xa6 \xf0\x9f\x8f\xbb) then I get the following traceback

Traceback (most recent call last):
  File "/.../.venv/bin/cz", line 8, in <module>
    sys.exit(main())
  File "/.../.venv/lib/python3.10/site-packages/commitizen/cli.py", line 389, in main
    args.func(conf, vars(args))()
  File "/.../.venv/lib/python3.10/site-packages/commitizen/commands/changelog.py", line 143, in __call__
    commits = git.get_commits(
  File "/.../.venv/lib/python3.10/site-packages/commitizen/git.py", line 98, in get_commits
    c = cmd.run(command)
  File "/.../.venv/lib/python3.10/site-packages/commitizen/cmd.py", line 32, in run
    stdout.decode(chardet.detect(stdout)["encoding"] or "utf-8"),
  File "/opt/local/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/encodings/cp1254.py", line 15, in decode
    return codecs.charmap_decode(input,errors,decoding_table)
UnicodeDecodeError: 'charmap' codec can't decode byte 0x8f in position 1689: character maps to <undefined>

The result of chardet.detect() here

https://github.com/commitizen-tools/commitizen/blob/2ff9f155435b487057ce5bd8e32e1ab02fd57c94/commitizen/cmd.py#L26

is:

{'encoding': 'Windows-1254', 'confidence': 0.6864215607255395, 'language': 'Turkish'}

An interesting character encoding prediction with a low confidence, which in turn picks the incorrect codec and then decoding the bytes fails. Using decode("utf-8") works fine. It looks like issue https://github.com/chardet/chardet/issues/148 is related to this.

I think the fix would be something like this to replace these lines of code:

stdout, stderr = process.communicate()
return_code = process.returncode
try:
    stdout_s = stdout.decode("utf-8")  # Try this one first.
except UnicodeDecodeError:
    result = chardet.detect(stdout)  # Final result of the UniversalDetectorā€™s prediction.
    # Consider checking confidence value of the result?
    stdout_s = stdout.decode(result["encoding"])
try:
    stderr_s = stderr.decode("utf-8")  # Try this one first.
except UnicodeDecodeError:
    result = chardet.detect(stderr)  # Final result of the UniversalDetectorā€™s prediction.
    # Consider checking confidence value of the result?
    stderr_s = stderr.decode(result["encoding"])
return Command(stdout_s, stderr_s, stdout, stderr, return_code)

Steps to reproduce

Well I suppose you can add a few commits to a local branch an go crazy with much text and funky unicode characters (emojis with skin tones, flags, etc.), and then attempt to create a changelog.

Current behavior

cz throws an exception.

Desired behavior

cz creates a changelog.

Screenshots

No response

Environment

> cz version
2.29.3
> python --version
Python 3.10.5
> uname -a
Darwin pooh 18.7.0 Darwin Kernel Version 18.7.0: Mon Feb 10 21:08:45 PST 2020; root:xnu-4903.278.28~1/RELEASE_X86_64 x86_64 i386 Darwin

Issue Analytics

  • State:closed
  • Created a year ago
  • Comments:17 (15 by maintainers)

github_iconTop GitHub Comments

2reactions
jenstroegercommented, Aug 7, 2022

@gpongelli this issue isnā€™t really a bug and itā€™s not about emojis.

The problem in this issue is about a sequence of bytes which contains UTF-8 encoded text, but the bytesā€™ encoding is mispredicted as Windows 1254 encoding. Based on that misprediction commitizen picks the incorrect codec to decode/interpret the bytes and that fails.

Thus, my proposed solution is to try to decode the bytes using the UTF-8 codec first because thatā€™s the common text encoding across platforms these days. Only if that fails, invoke some statistical analysis (e.g. chardet) to predict the text encoding (see also chardet FAQ).

Python encodes text as UTF-8 by default, but it also provides a large number of other text codecs you should consider when testing. I think, though, that UTF-8 is the common default encoding these days on many platforms.

2reactions
jenstroegercommented, Aug 5, 2022

@KyleKing actuallyā€¦

Iā€™m not sure if the change would work, but it might be good to add a test case for STDOUT like: bytes([0x73, 0xe2, 0x80, 0x9d]).

That byte sequence encodes two characters:

>>> bytes([0x73, 0xe2, 0x80, 0x9d]).decode()
'sā€'

where the bytes e2 80 9d are the UTF-8 encoding of ā€ (U+201D, or ā€œRIGHT DOUBLE QUOTATION MARKā€). Injecting U+FFFD whenever a character canā€™t be decoded using the replace codec may have undesired consequences:

>>> bytes([0x73, 0xe2, 0x80]).decode(encoding="utf8", errors="replace")
'sļæ½'

which may be confusing to peopleā€”my personal preference would be failure.

Judging from your stacktrace, I see a

  File "/usr/local/Cellar/python@3.9/3.9.13_1/Frameworks/Python.framework/Versions/3.9/lib/python3.9/encodings/cp1254.py", line 15, in decode
    return codecs.charmap_decode(input,errors,decoding_table)
           ā”‚                     ā”‚     ā”‚      ā”” '\x00\x01\x02\x03\x04\x05\x06\x07\x08\t\n\x0b\x0c\r\x0e\x0f\x10\x11\x12\x13\x14\x15\x16\x17\x18\x19\x1a\x1b\x1c\x1d\x1e\x1f !"#$...
           ā”‚                     ā”‚     ā”” 'strict'
           ā”‚                     ā”” 
           ā”” 
UnicodeDecodeError: 'charmap' codec can't decode byte 0x9d in position 32110: character maps to 

which indicates that decode() tries to interpret your bytes as Windows-1254 encoded string (note the encodings/cp1254.py where the exception originates), and that fails because itā€™s a UTF-8 encoded string:

>>> bytes([0x73, 0xe2, 0x80, 0x9d]).decode(encoding="utf-8")
'sā€'
>>> bytes([0x73, 0xe2, 0x80, 0x9d]).decode(encoding="windows-1254")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/opt/local/Library/Frameworks/Python.framework/Versions/3.9/lib/python3.9/encodings/cp1254.py", line 15, in decode
    return codecs.charmap_decode(input,errors,decoding_table)
UnicodeDecodeError: 'charmap' codec can't decode byte 0x9d in position 3: character maps to <undefined>

Take a look at https://github.com/chardet/chardet/issues/148 for some more details, and at PR #545 which addresses this very issue (i.e. mispredicting a bytes object as Windows 1254 encoded string, instead of a UTF-8 encoded string).

Read more comments on GitHub >

github_iconTop Results From Across the Web

UnicodeDecodeError if the commiter has non-ascii characters
A UnicodeDecodeError is raised with my commits. If I change my name to ... UnicodeDecodeError if the commiter has non-ascii characters #114.
Read more >
Fix UnicodeDecodeError when commit message contains non ...
When results_unicode is True , subprocess is called with universal_newlines=True . In this case, it should return results as unicode insteadĀ ...
Read more >
Python UnicodeDecodeError - How to correctly read unicode ...
My script basically wants to open a subprocess, which returns some strings with the stdout.read() function. Some of those strings may contain ......
Read more >
pytest hits UnicodeDecodeError in reporting assert failures for ...
UnicodeDecodeError when the str contains non-ascii bytes. To fix this, this patch explicitly decodes the input str using 'utf-8' encoding. AfterĀ ...
Read more >
835586 ā€“ UnicodeDecodeError: 'utf8' codec can't decode byte ...
Content synchronization failed due to decoding Unicode characters in usernames. This update automatically sets an ASCII-based label for identification purposes.
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found