Problems when dealing with invalidly-encoded filenames
See original GitHub issue- Operating System: Debian 9
- Node.js version: 8.9.3
fs-extra
version: 5.0.0
Hi there. I ran into some cases where remove()
was unable to remove a directory due to filename encoding issues. I believe there are similar issues using empty
, copy
, and move
operations (and their sync counterparts - basically anything that relies on fs.readdir
/ fs.readdirSync
).
My issue arose when trying to fs.remove()
some directories that were created from an unzip operation. During remove
s / rimraf
’s tree walk, some of the returned directories seemed not to exist (although they did), causing the final unlink
operation to fail (since it wasn’t actually successfully emptied).
It seems that, in general, names on a file system are just byte sequences, which are not guaranteed to represent fully valid strings. This causes the bytes-> string -> bytes operation, that happens when listing and then operating on items in a directory using Node, to not always produce the same file name that it read.
This encoding problem has been a known Node issue for a while, which is why an option was added to return Buffer
s from fs.readdir
. My suggestion is to update the affected methods to use this Buffer
option. I’m happy to work on a PR, but I wanted to at least get some feedback and discuss the issue before diving in.
Here are a couple Node issues relating to the file name encoding problem:
https://github.com/nodejs/node-v0.x-archive/issues/2387 https://github.com/nodejs/node/pull/5616
Thanks!
Issue Analytics
- State:
- Created 5 years ago
- Comments:5
@rossj @RyanZim, bringing this issue back up, because we face the same problem with
fs.cp()
in Node.js.I’ve been working on a port of Node.js’ path methods that work on Buffers:
https://github.com/bcoe/path-buffer
I’ve made an effort to detect
utf8
vs.,utf16
, so that the appropriate separator is added or removed by methods likejoin
anddirname
, but I’m not an expert at string encodings, so it would be good to have someone who’s bumped into the issue confirm the logic is sound.Ah, I was thinking of not filtering non-UTF8 names and just sending whatever string we get from the UTF8 conversion to the
filter
function. I’m pretty sure thatBuffer.toString()
will insert U+FFFD � for invalid UTF-8 sequences instead of failing. Continuing to send these potentially-incorrect strings to the filter function is no worse than the current situation, and it allows for string-based filtering of all files (regardless of if they are UTF8 or not) if the user only cares about ASCII, e.g.return src.indexOf('thing') >= 0
.