question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

UTF-8 – CP1252 encoding issue in exported HTML report

See original GitHub issue

System Information

  • OS: Windows 10
  • Python version: 3.8.10
  • Python environment: conda
  • Using jupyter: true
  • Datapane version: 0.11.11

Bug / Issue

When displaying a pandas dataframe in DataPane as a Table (not DataTable, which does work correctly), euro sign characters (€) display as €:

image

This doesn’t happen inside JupyterLab, or when exporting the original dataframe to html using df.to_html(). I am calling report.save() rather than upload as I want to generate local html reports.

In #9 you mention it could be an issue with Windows’ default encoding not being UTF-8, are there any steps I should take to fix this?

Thank you!

Issue Analytics

  • State:closed
  • Created 2 years ago
  • Comments:6 (3 by maintainers)

github_iconTop GitHub Comments

2reactions
mandscommented, Aug 18, 2021

@inigohidalgo thanks for the detailed investigation - that really helped!

There were 2 different issues here,

  1. We were not handling utf-8 encoding correctly in the browser, as per your investigation, resulting in the encoding issues you were seeing with dp.Table
  2. The issue with dp.File actually turned out to be a Windows / Python compatibility issue - where Python defaults to using the locale encoding when reading files if one is not specified. This is usually ok, but many tools on Windows now create utf-8 files by default, including Notepad, so you can have a case where the file is utf-8, but python is reading it using windows-1252 encoding - resulting in the issue both here and in #115. We’ve fixed that now by attempting to autodetect utf-8 files on Windows and decoding accordingly. (There is actually a PEP to change the default read behaviour - see https://discuss.python.org/t/pep-597-use-utf-8-for-default-text-file-encoding/1819)

We’ve just released a fix to both of these that will be pushed to pip in a few days.

Thanks again!

1reaction
inigohidalgocommented, Aug 11, 2021

Some further info. In #115 I saw OP was having this issue with text too whereas text worked fine for me, but he was importing a md file while I was sending the text through directly from Python as strings. Turns out I have the same issue. The following code leads to the output I attach below.

image

image

The only text correctly displayed is the one where I manually specified the UTF-8 encoding. This is because, from what I gather, python uses locale.getpreferredencoding() to decode by default, which on my machine is cp1252, so maybe JS is doing the same?

The html tables being generated when saving as temp files are UTF-8 encoded, and the code is then reading the bytes in the line I referenced in this comment.

The way the characters are displaying seems to 100% be due to decoding UTF-8 encoded text using the cp1252 character map, since when I read the bytes myself and decode using cp1252 this is what happens, note the last line:

image

Which as you can see is the same as the wrongly displayed text in the generated report.

image

And matches up with the mismatched character from this list:

https://www.i18nqa.com/debug/utf8-debug.html

The objects are being passed through in the HTML and this is where I get lost and can’t be of any further help, but my suggestion would be to check the default encoding assumed by JavaScript, or whatever part of your stack is reading the _E element being sent through. From what I see in the HTML template the charset is defined as UTF-8, but I don’t know if this applies when reading the b64 encoded binaries.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Wrong Character Encoding when exporting jasper as HTML ...
i tried changing the encoding to UTF8 and cp1252 when calling the class SimpleHtmlExporterOutput but still facing the same issue,.
Read more >
How to set encoding for exports to resolve odd and incorrect ...
Example jrxml tags are: <property name="net.sf.jasperreports.export.character.encoding" value="Cp1252"/>. UTF-8 is default, but you could do others, ...
Read more >
Some characters cannot be mapped using "Cp1252 ... - Drupal
I was exporting a view created from Relation, and within the view ... Some characters cannot be mapped using "Cp1252" character encoding.
Read more >
UTF-8 problems, again - WordPress.org
What's interesting is that it's recommended to not use the HTML entity for the comma, rather, ... The file is being exported as...
Read more >
Handling encoding issues with Unicode normalisation in Python
Only the UTF family supports all Unicode characters. The most commonly used encoding is UTF-8, so stick with that whenever possible. With str....
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found