Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Reading UTF-8 encoded buffers results in decoding errors

See original GitHub issue

If you pass an utf-8 encoded buffer to read with {type: 'buffer', codepage: 65001} options, the cell values will have unicode replacement characters in place of non-ASCII characters.

In version 0.16.8, this only happened if the buffer started with a BOM. Now, it always happens.

The reasons is that prn_to_sheet tries to decode buffers twice:

function prn_to_sheet(d, opts) {
	var str = "", bytes = opts.type == 'string' ? [0,0,0,0] : firstbyte(d, opts);
	switch(opts.type) {
		case 'base64': str = Base64.decode(d); break;
		case 'binary': str = d; break;
		case 'buffer':
			if(opts.codepage == 65001) str = d.toString('utf8'); // TODO: test if buf
			else if(opts.codepage && typeof cptable !== 'undefined') str = cptable.utils.decode(opts.codepage, d);
			else str = has_buf && Buffer.isBuffer(d) ? d.toString('binary') : a2s(d);
			break;
		case 'array': str = cc2str(d); break;
		case 'string': str = d; break;
		default: throw new Error("Unrecognized type " + opts.type);
	}
	if(bytes[0] == 0xEF && bytes[1] == 0xBB && bytes[2] == 0xBF) str = utf8read(str.slice(3));
	else if(opts.type != 'string' && opts.codepage == 65001) str = utf8read(str);
	else if((opts.type == 'binary') && typeof cptable !== 'undefined' && opts.codepage)  str = cptable.utils.decode(opts.codepage, cptable.utils.encode(28591,str));
	if(str.slice(0,19) == "socialcalc:version:") return ETH.to_sheet(opts.type == 'string' ? str : utf8read(str), opts);
	return prn_to_sheet_str(str, opts);
}

The switch uses the buffer case, which calls .toString('utf'), so str includes the properly decoded string. The function should just go on to call prn_to_sheet_str.

Instead, it calls utf8read which does weird things and effectively breaks the string. I guess it’s meant to be used with the opts.type === 'binary', but instead, it’s used for buffer as well, and it’s used if the string starts with a BOM.

Issue Analytics

State:
Created 2 years ago
Reactions:1
Comments:5 (5 by maintainers)

Top GitHub Comments

1reaction

hugoaboudcommented, Dec 16, 2021

I’ve been trying to figure this one out for a few hours now. Based on your description, a temporary workaround for my case was adding opts.type != 'buffer' to line 7904 of xlsx.js.

else if(opts.type != 'string' && opts.type != 'buffer' && opts.codepage == 65001) str = utf8read(str);

0reactions

reviewhercommented, Feb 8, 2022

https://github.com/SheetJS/sheetjs/commit/92d8a38ef62bdaee55a5d5d109ca22515fe215be

Top Results From Across the Web

How do I capture utf-8 decode errors in node.js?

I just discovered that Node (tested: v0.8.23, current git: v0.11.3-pre) ignores any decoding errors in its Buffer handling, silently replacing ...

Encoding | Protocol Buffers

As described, a string must use UTF-8 character encoding. A string cannot exceed 2GB. As described, bytes can store custom data types, up...

codecs – String encoding and decoding - Python Module of ...

The result of encoding a unicode string is a str object. Given a sequence of encoded bytes as a str instance, the decode()...

Buffer | Node.js v19.3.0 Documentation

This is the default character encoding. When decoding a Buffer into a string that does not exclusively contain valid UTF-8 data, the Unicode...

Codec registry and base classes

Bytes read from the original file are decoded according to file_encoding, and the result is encoded using data_encoding. If file_encoding is not given,...