question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Reading UTF-8 encoded buffers results in decoding errors

See original GitHub issue

If you pass an utf-8 encoded buffer to read with {type: 'buffer', codepage: 65001} options, the cell values will have unicode replacement characters in place of non-ASCII characters.

In version 0.16.8, this only happened if the buffer started with a BOM. Now, it always happens.

The reasons is that prn_to_sheet tries to decode buffers twice:

function prn_to_sheet(d, opts) {
	var str = "", bytes = opts.type == 'string' ? [0,0,0,0] : firstbyte(d, opts);
	switch(opts.type) {
		case 'base64': str = Base64.decode(d); break;
		case 'binary': str = d; break;
		case 'buffer':
			if(opts.codepage == 65001) str = d.toString('utf8'); // TODO: test if buf
			else if(opts.codepage && typeof cptable !== 'undefined') str = cptable.utils.decode(opts.codepage, d);
			else str = has_buf && Buffer.isBuffer(d) ? d.toString('binary') : a2s(d);
			break;
		case 'array': str = cc2str(d); break;
		case 'string': str = d; break;
		default: throw new Error("Unrecognized type " + opts.type);
	}
	if(bytes[0] == 0xEF && bytes[1] == 0xBB && bytes[2] == 0xBF) str = utf8read(str.slice(3));
	else if(opts.type != 'string' && opts.codepage == 65001) str = utf8read(str);
	else if((opts.type == 'binary') && typeof cptable !== 'undefined' && opts.codepage)  str = cptable.utils.decode(opts.codepage, cptable.utils.encode(28591,str));
	if(str.slice(0,19) == "socialcalc:version:") return ETH.to_sheet(opts.type == 'string' ? str : utf8read(str), opts);
	return prn_to_sheet_str(str, opts);
}

The switch uses the buffer case, which calls .toString('utf'), so str includes the properly decoded string. The function should just go on to call prn_to_sheet_str.

Instead, it calls utf8read which does weird things and effectively breaks the string. I guess it’s meant to be used with the opts.type === 'binary', but instead, it’s used for buffer as well, and it’s used if the string starts with a BOM.

Issue Analytics

  • State:closed
  • Created 2 years ago
  • Reactions:1
  • Comments:5 (5 by maintainers)

github_iconTop GitHub Comments

1reaction
hugoaboudcommented, Dec 16, 2021

I’ve been trying to figure this one out for a few hours now. Based on your description, a temporary workaround for my case was adding opts.type != 'buffer' to line 7904 of xlsx.js.

else if(opts.type != 'string' && opts.type != 'buffer' && opts.codepage == 65001) str = utf8read(str);
Read more comments on GitHub >

github_iconTop Results From Across the Web

How do I capture utf-8 decode errors in node.js?
I just discovered that Node (tested: v0.8.23, current git: v0.11.3-pre) ignores any decoding errors in its Buffer handling, silently replacing ...
Read more >
Encoding | Protocol Buffers
As described, a string must use UTF-8 character encoding. A string cannot exceed 2GB. As described, bytes can store custom data types, up...
Read more >
codecs – String encoding and decoding - Python Module of ...
The result of encoding a unicode string is a str object. Given a sequence of encoded bytes as a str instance, the decode()...
Read more >
Buffer | Node.js v19.3.0 Documentation
This is the default character encoding. When decoding a Buffer into a string that does not exclusively contain valid UTF-8 data, the Unicode...
Read more >
Codec registry and base classes
Bytes read from the original file are decoded according to file_encoding, and the result is encoded using data_encoding. If file_encoding is not given,...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found