Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Chinese filename decode

See original GitHub issue

File: 中文测试.zip

The zip file contains 中文测试.md，when I pass decodeStrings: true, the result is

when I pass decodeStrings: false, the error The "path" argument must be of type string be thrown.

Issue Analytics

State:
Created 5 years ago
Comments:12 (6 by maintainers)

Top GitHub Comments

5reactions

thejoshwolfecommented, Oct 5, 2018

I did some research into Info-ZIP’s charset detection code, and in the absence of General Purpose Bit 11, Info-ZIP uses a different charset depending on the operating system. It will only use CP437 as required by the spec on some platforms, presumably DOS. However, on Linux and Mac, Info-ZIP will simply always use UTF-8 for decoding file paths, because UTF-8 is the “native” charset on those platforms, whatever that means. This suggests it’s safe for yauzl to drop support for CP437 and just use UTF-8 in all situations as well. 🤔

2reactions

rossjcommented, May 6, 2018

@imcuttle I have a need to handle similar not-so-standard .zip files in my application, and I wanted to share my heuristic solution.

If you only need to deal with this file and similar files that are always UTF-8 (even if they don’t indicate this), you can use the decodeStrings: true option and convert them to strings yourself. Your The "path" argument must be of type string error is likely coming from some other code downstream that is expecting it to be a string. You probably need to do the Buffer -> string conversion before this point.

In my case, it is a bit more complicated, as I need to simultaneously handle zip files that are UTF-8 (with and without the proper bit being set), as well as files that are CP437 encoded. My solution is to use decodeStrings: false, collect all of the ZipEntries and fileName Buffers, and then to inspect these name Buffers to try and guess the proper encoding.

Specifically, I use the code in this gist to get some information on the name Buffers, followed by this logic:

const aggs = checkStringBufs(entries.map(entry => entry.fileName as Buffer));

let encoding: string;
if (aggs.allAsciiChar) {
    // utf8 is backwards compatible with ascii
    encoding = 'utf8';
} else if (aggs.all7Bit) {
    // Hmmm, no high bits but some control chars, probably cp437
    encoding = 'cp437';
} else if (aggs.validUtf8) {
    // Some high bits set, but seems to be UTF-8
    encoding = 'utf8';
} else {
    // Some high bits set, but not UTF-8!
    encoding = 'cp437';
}

This has been working well for the .zip files that I deal with.

Top Results From Across the Web

Corrupted Chinese File Name with Un-ZIP

If you are curious about the default encodings used by macOS and Linux that generated corrupted Chinese file names from the ZIP archive,...

How to decode contents of a batch file with chinese characters

Yeah, so it's just ASCII text that is being misdetected as UTF-16. Tell your editor to load it as Windows-1252 (or ISO-8859-1, or...

python - Why did I get UnicodeDecodeError when I read a file ...

Why did I get UnicodeDecodeError when I read a file which contains Chinese characters? · How to Ask and minimal reproducible example ·...

Chinese characters SOMETIMES not decoded properly in ...

The first screenshot below shows an installer of a Chinese application. The second one shows Chinese characters displayed just fine in filenames ......

1.2 Chinese characters decoding · Python Learning Notes

Summary: Use open("filename", "encoding=xxx") when reading unicode data from a file. I stucked on this issue for couple hours, and read several blogs...