UTF-8 – CP1252 encoding issue in exported HTML report
See original GitHub issueSystem Information
- OS: Windows 10
- Python version: 3.8.10
- Python environment: conda
- Using jupyter: true
- Datapane version: 0.11.11
Bug / Issue
When displaying a pandas dataframe in DataPane as a Table (not DataTable, which does work correctly), euro sign characters (€) display as â¬:
This doesn’t happen inside JupyterLab, or when exporting the original dataframe to html using df.to_html()
. I am calling report.save()
rather than upload
as I want to generate local html reports.
In #9 you mention it could be an issue with Windows’ default encoding not being UTF-8, are there any steps I should take to fix this?
Thank you!
Issue Analytics
- State:
- Created 2 years ago
- Comments:6 (3 by maintainers)
Top Results From Across the Web
Wrong Character Encoding when exporting jasper as HTML ...
i tried changing the encoding to UTF8 and cp1252 when calling the class SimpleHtmlExporterOutput but still facing the same issue,.
Read more >How to set encoding for exports to resolve odd and incorrect ...
Example jrxml tags are: <property name="net.sf.jasperreports.export.character.encoding" value="Cp1252"/>. UTF-8 is default, but you could do others, ...
Read more >Some characters cannot be mapped using "Cp1252 ... - Drupal
I was exporting a view created from Relation, and within the view ... Some characters cannot be mapped using "Cp1252" character encoding.
Read more >UTF-8 problems, again - WordPress.org
What's interesting is that it's recommended to not use the HTML entity for the comma, rather, ... The file is being exported as...
Read more >Handling encoding issues with Unicode normalisation in Python
Only the UTF family supports all Unicode characters. The most commonly used encoding is UTF-8, so stick with that whenever possible. With str....
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
@inigohidalgo thanks for the detailed investigation - that really helped!
There were 2 different issues here,
dp.Table
dp.File
actually turned out to be a Windows / Python compatibility issue - where Python defaults to using the locale encoding when reading files if one is not specified. This is usually ok, but many tools on Windows now create utf-8 files by default, including Notepad, so you can have a case where the file is utf-8, but python is reading it usingwindows-1252
encoding - resulting in the issue both here and in #115. We’ve fixed that now by attempting to autodetect utf-8 files on Windows and decoding accordingly. (There is actually a PEP to change the default read behaviour - see https://discuss.python.org/t/pep-597-use-utf-8-for-default-text-file-encoding/1819)We’ve just released a fix to both of these that will be pushed to pip in a few days.
Thanks again!
Some further info. In #115 I saw OP was having this issue with text too whereas text worked fine for me, but he was importing a md file while I was sending the text through directly from Python as strings. Turns out I have the same issue. The following code leads to the output I attach below.
The only text correctly displayed is the one where I manually specified the UTF-8 encoding. This is because, from what I gather, python uses
locale.getpreferredencoding()
to decode by default, which on my machine is cp1252, so maybe JS is doing the same?The html tables being generated when saving as temp files are UTF-8 encoded, and the code is then reading the bytes in the line I referenced in this comment.
The way the characters are displaying seems to 100% be due to decoding UTF-8 encoded text using the cp1252 character map, since when I read the bytes myself and decode using cp1252 this is what happens, note the last line:
Which as you can see is the same as the wrongly displayed text in the generated report.
And matches up with the mismatched character from this list:
https://www.i18nqa.com/debug/utf8-debug.html
The objects are being passed through in the HTML and this is where I get lost and can’t be of any further help, but my suggestion would be to check the default encoding assumed by JavaScript, or whatever part of your stack is reading the
_E
element being sent through. From what I see in the HTML template the charset is defined as UTF-8, but I don’t know if this applies when reading the b64 encoded binaries.