question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Chinese filename decode

See original GitHub issue

File: 中文测试.zip

The zip file contains 中文测试.md,when I pass decodeStrings: true, the result is image

when I pass decodeStrings: false, the error The "path" argument must be of type string be thrown.

Issue Analytics

  • State:open
  • Created 5 years ago
  • Comments:12 (6 by maintainers)

github_iconTop GitHub Comments

5reactions
thejoshwolfecommented, Oct 5, 2018

I did some research into Info-ZIP’s charset detection code, and in the absence of General Purpose Bit 11, Info-ZIP uses a different charset depending on the operating system. It will only use CP437 as required by the spec on some platforms, presumably DOS. However, on Linux and Mac, Info-ZIP will simply always use UTF-8 for decoding file paths, because UTF-8 is the “native” charset on those platforms, whatever that means. This suggests it’s safe for yauzl to drop support for CP437 and just use UTF-8 in all situations as well. 🤔

2reactions
rossjcommented, May 6, 2018

@imcuttle I have a need to handle similar not-so-standard .zip files in my application, and I wanted to share my heuristic solution.

If you only need to deal with this file and similar files that are always UTF-8 (even if they don’t indicate this), you can use the decodeStrings: true option and convert them to strings yourself. Your The "path" argument must be of type string error is likely coming from some other code downstream that is expecting it to be a string. You probably need to do the Buffer -> string conversion before this point.

In my case, it is a bit more complicated, as I need to simultaneously handle zip files that are UTF-8 (with and without the proper bit being set), as well as files that are CP437 encoded. My solution is to use decodeStrings: false, collect all of the ZipEntries and fileName Buffers, and then to inspect these name Buffers to try and guess the proper encoding.

Specifically, I use the code in this gist to get some information on the name Buffers, followed by this logic:

const aggs = checkStringBufs(entries.map(entry => entry.fileName as Buffer));

let encoding: string;
if (aggs.allAsciiChar) {
    // utf8 is backwards compatible with ascii
    encoding = 'utf8';
} else if (aggs.all7Bit) {
    // Hmmm, no high bits but some control chars, probably cp437
    encoding = 'cp437';
} else if (aggs.validUtf8) {
    // Some high bits set, but seems to be UTF-8
    encoding = 'utf8';
} else {
    // Some high bits set, but not UTF-8!
    encoding = 'cp437';
}

This has been working well for the .zip files that I deal with.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Corrupted Chinese File Name with Un-ZIP
If you are curious about the default encodings used by macOS and Linux that generated corrupted Chinese file names from the ZIP archive,...
Read more >
How to decode contents of a batch file with chinese characters
Yeah, so it's just ASCII text that is being misdetected as UTF-16. Tell your editor to load it as Windows-1252 (or ISO-8859-1, or...
Read more >
python - Why did I get UnicodeDecodeError when I read a file ...
Why did I get UnicodeDecodeError when I read a file which contains Chinese characters? · How to Ask and minimal reproducible example ·...
Read more >
Chinese characters SOMETIMES not decoded properly in ...
The first screenshot below shows an installer of a Chinese application. The second one shows Chinese characters displayed just fine in filenames ......
Read more >
1.2 Chinese characters decoding · Python Learning Notes
Summary: Use open("filename", "encoding=xxx") when reading unicode data from a file. I stucked on this issue for couple hours, and read several blogs...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found