Reading UTF-8 encoded buffers results in decoding errors
See original GitHub issueIf you pass an utf-8 encoded buffer to read
with {type: 'buffer', codepage: 65001}
options, the cell values will have unicode replacement characters in place of non-ASCII characters.
In version 0.16.8, this only happened if the buffer started with a BOM. Now, it always happens.
The reasons is that prn_to_sheet
tries to decode buffers twice:
function prn_to_sheet(d, opts) {
var str = "", bytes = opts.type == 'string' ? [0,0,0,0] : firstbyte(d, opts);
switch(opts.type) {
case 'base64': str = Base64.decode(d); break;
case 'binary': str = d; break;
case 'buffer':
if(opts.codepage == 65001) str = d.toString('utf8'); // TODO: test if buf
else if(opts.codepage && typeof cptable !== 'undefined') str = cptable.utils.decode(opts.codepage, d);
else str = has_buf && Buffer.isBuffer(d) ? d.toString('binary') : a2s(d);
break;
case 'array': str = cc2str(d); break;
case 'string': str = d; break;
default: throw new Error("Unrecognized type " + opts.type);
}
if(bytes[0] == 0xEF && bytes[1] == 0xBB && bytes[2] == 0xBF) str = utf8read(str.slice(3));
else if(opts.type != 'string' && opts.codepage == 65001) str = utf8read(str);
else if((opts.type == 'binary') && typeof cptable !== 'undefined' && opts.codepage) str = cptable.utils.decode(opts.codepage, cptable.utils.encode(28591,str));
if(str.slice(0,19) == "socialcalc:version:") return ETH.to_sheet(opts.type == 'string' ? str : utf8read(str), opts);
return prn_to_sheet_str(str, opts);
}
The switch uses the buffer
case, which calls .toString('utf')
, so str
includes the properly decoded string. The function should just go on to call prn_to_sheet_str
.
Instead, it calls utf8read
which does weird things and effectively breaks the string. I guess it’s meant to be used with the opts.type === 'binary'
, but instead, it’s used for buffer
as well, and it’s used if the string starts with a BOM.
Issue Analytics
- State:
- Created 2 years ago
- Reactions:1
- Comments:5 (5 by maintainers)
Top Results From Across the Web
How do I capture utf-8 decode errors in node.js?
I just discovered that Node (tested: v0.8.23, current git: v0.11.3-pre) ignores any decoding errors in its Buffer handling, silently replacing ...
Read more >Encoding | Protocol Buffers
As described, a string must use UTF-8 character encoding. A string cannot exceed 2GB. As described, bytes can store custom data types, up...
Read more >codecs – String encoding and decoding - Python Module of ...
The result of encoding a unicode string is a str object. Given a sequence of encoded bytes as a str instance, the decode()...
Read more >Buffer | Node.js v19.3.0 Documentation
This is the default character encoding. When decoding a Buffer into a string that does not exclusively contain valid UTF-8 data, the Unicode...
Read more >Codec registry and base classes
Bytes read from the original file are decoded according to file_encoding, and the result is encoded using data_encoding. If file_encoding is not given,...
Read more >
Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free
Top Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
I’ve been trying to figure this one out for a few hours now. Based on your description, a temporary workaround for my case was adding
opts.type != 'buffer'
to line 7904 ofxlsx.js
.https://github.com/SheetJS/sheetjs/commit/92d8a38ef62bdaee55a5d5d109ca22515fe215be