`UnicodeDecodeError` if commit messages contain Unicode characters
See original GitHub issueDescription
If I run
cz changelog
and the commit messages contain Unicode characters like š¤¦š»āāļø (which is an eight-byte utf-8 sequence: \xf0\x9f\xa4\xa6 \xf0\x9f\x8f\xbb
) then I get the following traceback
Traceback (most recent call last):
File "/.../.venv/bin/cz", line 8, in <module>
sys.exit(main())
File "/.../.venv/lib/python3.10/site-packages/commitizen/cli.py", line 389, in main
args.func(conf, vars(args))()
File "/.../.venv/lib/python3.10/site-packages/commitizen/commands/changelog.py", line 143, in __call__
commits = git.get_commits(
File "/.../.venv/lib/python3.10/site-packages/commitizen/git.py", line 98, in get_commits
c = cmd.run(command)
File "/.../.venv/lib/python3.10/site-packages/commitizen/cmd.py", line 32, in run
stdout.decode(chardet.detect(stdout)["encoding"] or "utf-8"),
File "/opt/local/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/encodings/cp1254.py", line 15, in decode
return codecs.charmap_decode(input,errors,decoding_table)
UnicodeDecodeError: 'charmap' codec can't decode byte 0x8f in position 1689: character maps to <undefined>
The result of chardet.detect()
here
is:
{'encoding': 'Windows-1254', 'confidence': 0.6864215607255395, 'language': 'Turkish'}
An interesting character encoding prediction with a low confidence, which in turn picks the incorrect codec and then decoding the bytes
fails. Using decode("utf-8")
works fine. It looks like issue https://github.com/chardet/chardet/issues/148 is related to this.
I think the fix would be something like this to replace these lines of code:
stdout, stderr = process.communicate()
return_code = process.returncode
try:
stdout_s = stdout.decode("utf-8") # Try this one first.
except UnicodeDecodeError:
result = chardet.detect(stdout) # Final result of the UniversalDetectorās prediction.
# Consider checking confidence value of the result?
stdout_s = stdout.decode(result["encoding"])
try:
stderr_s = stderr.decode("utf-8") # Try this one first.
except UnicodeDecodeError:
result = chardet.detect(stderr) # Final result of the UniversalDetectorās prediction.
# Consider checking confidence value of the result?
stderr_s = stderr.decode(result["encoding"])
return Command(stdout_s, stderr_s, stdout, stderr, return_code)
Steps to reproduce
Well I suppose you can add a few commits to a local branch an go crazy with much text and funky unicode characters (emojis with skin tones, flags, etc.), and then attempt to create a changelog.
Current behavior
cz
throws an exception.
Desired behavior
cz
creates a changelog.
Screenshots
No response
Environment
> cz version
2.29.3
> python --version
Python 3.10.5
> uname -a
Darwin pooh 18.7.0 Darwin Kernel Version 18.7.0: Mon Feb 10 21:08:45 PST 2020; root:xnu-4903.278.28~1/RELEASE_X86_64 x86_64 i386 Darwin
Issue Analytics
- State:
- Created a year ago
- Comments:17 (15 by maintainers)
Top GitHub Comments
@gpongelli this issue isnāt really a bug and itās not about emojis.
The problem in this issue is about a sequence of bytes which contains UTF-8 encoded text, but the bytesā encoding is mispredicted as Windows 1254 encoding. Based on that misprediction commitizen picks the incorrect codec to decode/interpret the bytes and that fails.
Thus, my proposed solution is to try to decode the bytes using the UTF-8 codec first because thatās the common text encoding across platforms these days. Only if that fails, invoke some statistical analysis (e.g. chardet) to predict the text encoding (see also chardet FAQ).
Python encodes text as UTF-8 by default, but it also provides a large number of other text codecs you should consider when testing. I think, though, that UTF-8 is the common default encoding these days on many platforms.
@KyleKing actuallyā¦
That byte sequence encodes two characters:
where the bytes
e2 80 9d
are the UTF-8 encoding ofā
(U+201D, or āRIGHT DOUBLE QUOTATION MARKā). Injecting U+FFFD whenever a character canāt be decoded using thereplace
codec may have undesired consequences:which may be confusing to peopleāmy personal preference would be failure.
Judging from your stacktrace, I see a
which indicates that
decode()
tries to interpret your bytes as Windows-1254 encoded string (note theencodings/cp1254.py
where the exception originates), and that fails because itās a UTF-8 encoded string:Take a look at https://github.com/chardet/chardet/issues/148 for some more details, and at PR #545 which addresses this very issue (i.e. mispredicting a bytes object as Windows 1254 encoded string, instead of a UTF-8 encoded string).