Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

UTF-8 – CP1252 encoding issue in exported HTML report

See original GitHub issue

System Information

OS: Windows 10
Python version: 3.8.10
Python environment: conda
Using jupyter: true
Datapane version: 0.11.11

Bug / Issue

When displaying a pandas dataframe in DataPane as a Table (not DataTable, which does work correctly), euro sign characters (€) display as â¬:

This doesn’t happen inside JupyterLab, or when exporting the original dataframe to html using df.to_html(). I am calling report.save() rather than upload as I want to generate local html reports.

In #9 you mention it could be an issue with Windows’ default encoding not being UTF-8, are there any steps I should take to fix this?

Thank you!

Issue Analytics

State:
Created 2 years ago
Comments:6 (3 by maintainers)

Top GitHub Comments

2reactions

mandscommented, Aug 18, 2021

@inigohidalgo thanks for the detailed investigation - that really helped!

There were 2 different issues here,

We were not handling utf-8 encoding correctly in the browser, as per your investigation, resulting in the encoding issues you were seeing with dp.Table
The issue with dp.File actually turned out to be a Windows / Python compatibility issue - where Python defaults to using the locale encoding when reading files if one is not specified. This is usually ok, but many tools on Windows now create utf-8 files by default, including Notepad, so you can have a case where the file is utf-8, but python is reading it using windows-1252 encoding - resulting in the issue both here and in #115. We’ve fixed that now by attempting to autodetect utf-8 files on Windows and decoding accordingly. (There is actually a PEP to change the default read behaviour - see https://discuss.python.org/t/pep-597-use-utf-8-for-default-text-file-encoding/1819)

We’ve just released a fix to both of these that will be pushed to pip in a few days.

Thanks again!

1reaction

inigohidalgocommented, Aug 11, 2021

Some further info. In #115 I saw OP was having this issue with text too whereas text worked fine for me, but he was importing a md file while I was sending the text through directly from Python as strings. Turns out I have the same issue. The following code leads to the output I attach below.

The only text correctly displayed is the one where I manually specified the UTF-8 encoding. This is because, from what I gather, python uses locale.getpreferredencoding() to decode by default, which on my machine is cp1252, so maybe JS is doing the same?

The html tables being generated when saving as temp files are UTF-8 encoded, and the code is then reading the bytes in the line I referenced in this comment.

The way the characters are displaying seems to 100% be due to decoding UTF-8 encoded text using the cp1252 character map, since when I read the bytes myself and decode using cp1252 this is what happens, note the last line:

Which as you can see is the same as the wrongly displayed text in the generated report.

And matches up with the mismatched character from this list:

https://www.i18nqa.com/debug/utf8-debug.html

The objects are being passed through in the HTML and this is where I get lost and can’t be of any further help, but my suggestion would be to check the default encoding assumed by JavaScript, or whatever part of your stack is reading the _E element being sent through. From what I see in the HTML template the charset is defined as UTF-8, but I don’t know if this applies when reading the b64 encoded binaries.

Top Results From Across the Web

Wrong Character Encoding when exporting jasper as HTML ...

i tried changing the encoding to UTF8 and cp1252 when calling the class SimpleHtmlExporterOutput but still facing the same issue,.

How to set encoding for exports to resolve odd and incorrect ...

Example jrxml tags are: <property name="net.sf.jasperreports.export.character.encoding" value="Cp1252"/>. UTF-8 is default, but you could do others, ...

Some characters cannot be mapped using "Cp1252 ... - Drupal

I was exporting a view created from Relation, and within the view ... Some characters cannot be mapped using "Cp1252" character encoding.

UTF-8 problems, again - WordPress.org

What's interesting is that it's recommended to not use the HTML entity for the comma, rather, ... The file is being exported as...

Handling encoding issues with Unicode normalisation in Python

Only the UTF family supports all Unicode characters. The most commonly used encoding is UTF-8, so stick with that whenever possible. With str....