question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Browsers attempting to autodetect encoding leads to Unicode rendering issues in some replayed extractor outputs

See original GitHub issue

Describe the bug

When archive pages in danish language from https://politiken.dk/, unicode-characters are wrong.

Steps to reproduce

  1. Archive this page: https://politiken.dk/oekonomi/arbejdsmarked/art5521310/Rengøringsassistenter-blev-tvunget-op-i-ilmarch-med-749-kmt
  2. Compare the archived version with a live-version in a browser: https://politiken.dk/oekonomi/arbejdsmarked/art5521310/Rengøringsassistenter-blev-tvunget-op-i-ilmarch-med-749-kmt
  3. The three danish characters æ, ø and å are wrong.

Screenshots or log output

archive add 'https://politiken.dk/oekonomi/arbejdsmarked/art5521310/Reng%C3%B8ringsassistenter-blev-tvunget-op-i-ilmarch-med-749-kmt'
[i] [2020-08-14 11:34:08] ArchiveBox v0.4.13: archivebox add https://politiken.dk/oekonomi/arbejdsmarked/art5521310/Reng%C3%B8ringsassistenter-blev-tvunget-op-i-ilmarch-med-749-kmt < /dev/stdin
    > /data

[+] [2020-08-14 11:34:09] Adding 1 links to index (crawl depth=0)...
    > Saved verbatim input to sources/1597404849-import.txt
    > Parsed 1 URLs from input (Plain Text)
    > Found 1 new URLs not already in index

[*] [2020-08-14 11:34:09] Writing 2 links to main index...
    √ /data/index.sqlite3
    √ /data/index.json
    √ /data/index.html

[▶] [2020-08-14 11:34:09] Collecting content for 1 Snapshots in archive...

[+] [2020-08-14 11:34:09] "politiken.dk/oekonomi/arbejdsmarked/art5521310/Reng%C3%B8ringsassistenter-blev-tvunget-op-i-ilmarch-med-749-kmt"
    https://politiken.dk/oekonomi/arbejdsmarked/art5521310/Reng%C3%B8ringsassistenter-blev-tvunget-op-i-ilmarch-med-749-kmt
    > ./archive/1597404849
      > title
      > favicon
      > wget
      > singlefile
      > pdf
        Failed:
            Exception Failed to chmod: output.pdf does not exist (did the previous step fail?)
        Run to see full output:
            cd /data/archive/1597404849;
            chromium --headless --no-sandbox --disable-gpu --disable-dev-shm-usage --disable-software-rasterizer "--user-agent=Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.75 Safari/537.36" --window-size=1440,2000 --timeout=60000 --print-to-pdf https://politiken.dk/oekonomi/arbejdsmarked/art5521310/Reng%C3%B8ringsassistenter-blev-tvunget-op-i-ilmarch-med-749-kmt

      > screenshot
      > dom
      > media
      > archive_org

[√] [2020-08-14 11:34:38] Update of 1 pages complete (29.46 sec)
    - 0 links skipped
    - 0 links updated
    - 1 links had errors

    Hint: To view your archive index, open:
        /data/index.html
    Or run the built-in webserver:
        archivebox server

[*] [2020-08-14 11:34:38] Writing 2 links to main index...
    √ /data/index.sqlite3
    √ /data/index.json
    √ /data/index.html

Software versions

Issue Analytics

  • State:open
  • Created 3 years ago
  • Comments:11 (7 by maintainers)

github_iconTop GitHub Comments

1reaction
piratecommented, Sep 22, 2020

@MartinMSPedersen no, we’re still thinking about how to solve this by either storing and replaying headers or converting the encoding on-disk to UTF-8.

We’re currently stuck on reproducing the issue reliably, as it only happens when visiting the pages directly, but not when they’re iframed. Our suspicion is that this is a subtle behavior of Chrome’s automatic encoding detection, and our solution will involving nudging Chrome towards the right direction or finding out why it’s autodetecting differently based on whether the content is iframed or not.

1reaction
rfletchercommented, Sep 10, 2020

My current versions:

  • ArchiveBox 0.4.21
  • Safari 13.1.2 (latest)
  • macOS 10.15.6 (latest)

In my case it looks like these outputs are using the wrong encoding for the sixcolors.com URL (as viewed in Safari on macOS):

  • ❌ Wget WARC
  • ❌ Chrome HTML
  • ❌ Readability

These show expected output:

  • ✅ Chrome SingleFile
  • ✅ Archive.org
  • ✅ Original
  • ✅ Chrome PDF
  • ✅ Chrome screenshot

All three of the bad HTML documents show document.characterSet as "windows-1252". The rest show "UTF-8".

I think what might be happening is that the original page has the encoding information set in a response header (my example URL definitely includes content-type: text/html; charset=UTF-8), but when the HTML body is saved locally without headers that explicit encoding information is lost. At that point it’s up to the renderer to guess the encoding, and some are getting it wrong.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Display problems caused by the UTF-8 BOM - W3C
Answer. If you are dealing with a file encoded in UTF-8, your display problems may be caused by the presence of a UTF-8...
Read more >
Why do browsers need to be told the encoding of a file?
Unicode encodings are generally easier to guess due to the way UTF-8/16/32 are encoded. You can also force an encoding by putting a...
Read more >
Differrent Character set and NLS - Ask TOM
Hi TOM, Some question on Character set and NLS. 1) What is the different between US7ASCII and WE8ISO8859P1 character set and NLS? 2)...
Read more >
CyberChef
The Cyber Swiss Army Knife - a web app for encryption, encoding, compression and data analysis.
Read more >
QRadar APARs 101 - IBM
When a user has a rule name that contains a special character, the browser can display 'Checking disability' when you attempt to add...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found