unzip Russian files
See original GitHub issueIssue Description
What can’t you do right now? It happens that in Russia file names inside zip files are often encoded with cp866. Such filenames currently decoded incorrectly in fflate. The best I can do is
new TextDecoder('cp866').decode(strToU8(file.name))
but it produces correct characters interleaved with some gibberish.
An optimal solution Either provide the raw name in UnzipFile
{
name: string,//as it is decoded now
rawName: {
bytes: Uint8Array,
isUTF8: boolean
},
ondata: AsyncFlateStreamHandler,
...
}
, or make it possible to provide an encoding for entries marked as not utf-8.
unzip = new Unzip();
unzip.setFallbackEncoding('cp866');
(How) is this done by other libraries? jszip also fails to decode it correctly.
There is unzip -O cp866
in Ubuntu starting from some version, and before that version I believe they had a hack that would have used cp866 automatically if it had seen a Russian locale in the OS.
A browser equivalent for that hack would be navigator.language == 'ru-RU'
if you are willing to use that approach.
Issue Analytics
- State:
- Created a year ago
- Reactions:1
- Comments:7 (3 by maintainers)
https://github.com/vlm/zip-fix-filename-encoding/blob/master/src/runzip.c this might help a little bit. They are trying to guess an encoding by character frequencies there.
Also there are some test files that might be useful https://github.com/Stuk/jszip/tree/master/test/ref In particular
local_encoding_in_name.zip
has russian filenames inside, I think it is encoded with cp866 according to jszip tests.I was probably wrong about jszip in the first comment, apparently they are handling it somehow (or at least they have tests for that), yet for some of my files jszip produces something wrong in file names. And I definitely have cp866.
I think it is tempting to archivers authors to use one-byte encodings to save a few more bytes. So the problem is not going away any time soon.
But yeah. putting the problem on the user is fine by me too.