Browsers attempting to autodetect encoding leads to Unicode rendering issues in some replayed extractor outputs
See original GitHub issueDescribe the bug
When archive pages in danish language from https://politiken.dk/, unicode-characters are wrong.
Steps to reproduce
- Archive this page: https://politiken.dk/oekonomi/arbejdsmarked/art5521310/Rengøringsassistenter-blev-tvunget-op-i-ilmarch-med-749-kmt
- Compare the archived version with a live-version in a browser: https://politiken.dk/oekonomi/arbejdsmarked/art5521310/Rengøringsassistenter-blev-tvunget-op-i-ilmarch-med-749-kmt
- The three danish characters æ, ø and å are wrong.
Screenshots or log output
archive add 'https://politiken.dk/oekonomi/arbejdsmarked/art5521310/Reng%C3%B8ringsassistenter-blev-tvunget-op-i-ilmarch-med-749-kmt'
[i] [2020-08-14 11:34:08] ArchiveBox v0.4.13: archivebox add https://politiken.dk/oekonomi/arbejdsmarked/art5521310/Reng%C3%B8ringsassistenter-blev-tvunget-op-i-ilmarch-med-749-kmt < /dev/stdin
> /data
[+] [2020-08-14 11:34:09] Adding 1 links to index (crawl depth=0)...
> Saved verbatim input to sources/1597404849-import.txt
> Parsed 1 URLs from input (Plain Text)
> Found 1 new URLs not already in index
[*] [2020-08-14 11:34:09] Writing 2 links to main index...
√ /data/index.sqlite3
√ /data/index.json
√ /data/index.html
[▶] [2020-08-14 11:34:09] Collecting content for 1 Snapshots in archive...
[+] [2020-08-14 11:34:09] "politiken.dk/oekonomi/arbejdsmarked/art5521310/Reng%C3%B8ringsassistenter-blev-tvunget-op-i-ilmarch-med-749-kmt"
https://politiken.dk/oekonomi/arbejdsmarked/art5521310/Reng%C3%B8ringsassistenter-blev-tvunget-op-i-ilmarch-med-749-kmt
> ./archive/1597404849
> title
> favicon
> wget
> singlefile
> pdf
Failed:
Exception Failed to chmod: output.pdf does not exist (did the previous step fail?)
Run to see full output:
cd /data/archive/1597404849;
chromium --headless --no-sandbox --disable-gpu --disable-dev-shm-usage --disable-software-rasterizer "--user-agent=Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.75 Safari/537.36" --window-size=1440,2000 --timeout=60000 --print-to-pdf https://politiken.dk/oekonomi/arbejdsmarked/art5521310/Reng%C3%B8ringsassistenter-blev-tvunget-op-i-ilmarch-med-749-kmt
> screenshot
> dom
> media
> archive_org
[√] [2020-08-14 11:34:38] Update of 1 pages complete (29.46 sec)
- 0 links skipped
- 0 links updated
- 1 links had errors
Hint: To view your archive index, open:
/data/index.html
Or run the built-in webserver:
archivebox server
[*] [2020-08-14 11:34:38] Writing 2 links to main index...
√ /data/index.sqlite3
√ /data/index.json
√ /data/index.html
Software versions
- Using the docker image based on commit: https://github.com/pirate/ArchiveBox/tree/aa085cdb60d835c0c4fc07a4983c328a39cc9292
- ArchiveBox version: v0.4.13
Issue Analytics
- State:
- Created 3 years ago
- Comments:11 (7 by maintainers)
Top Results From Across the Web
Display problems caused by the UTF-8 BOM - W3C
Answer. If you are dealing with a file encoded in UTF-8, your display problems may be caused by the presence of a UTF-8...
Read more >Why do browsers need to be told the encoding of a file?
Unicode encodings are generally easier to guess due to the way UTF-8/16/32 are encoded. You can also force an encoding by putting a...
Read more >Differrent Character set and NLS - Ask TOM
Hi TOM, Some question on Character set and NLS. 1) What is the different between US7ASCII and WE8ISO8859P1 character set and NLS? 2)...
Read more >CyberChef
The Cyber Swiss Army Knife - a web app for encryption, encoding, compression and data analysis.
Read more >QRadar APARs 101 - IBM
When a user has a rule name that contains a special character, the browser can display 'Checking disability' when you attempt to add...
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
@MartinMSPedersen no, we’re still thinking about how to solve this by either storing and replaying headers or converting the encoding on-disk to UTF-8.
We’re currently stuck on reproducing the issue reliably, as it only happens when visiting the pages directly, but not when they’re iframed. Our suspicion is that this is a subtle behavior of Chrome’s automatic encoding detection, and our solution will involving nudging Chrome towards the right direction or finding out why it’s autodetecting differently based on whether the content is iframed or not.
My current versions:
In my case it looks like these outputs are using the wrong encoding for the sixcolors.com URL (as viewed in Safari on macOS):
These show expected output:
All three of the bad HTML documents show
document.characterSet
as"windows-1252"
. The rest show"UTF-8"
.I think what might be happening is that the original page has the encoding information set in a response header (my example URL definitely includes
content-type: text/html; charset=UTF-8
), but when the HTML body is saved locally without headers that explicit encoding information is lost. At that point it’s up to the renderer to guess the encoding, and some are getting it wrong.