ReadableStreamStreamer sometimes breaks UTF8-characters to ��
See original GitHub issueIssue
If I pass a CSV stream to Papa.parse
that contains special characters, it sometimes breaks the special characters so they show up as e.g. ��.
How to reproduce
See example at: https://repl.it/repls/ArcticTreasuredLine
Press “Run” at the top of the page
What should happen?
There should only be output of ä
character
What happened?
There’s random occurrences of ��
Root cause
These two lines are responsible for this issue: https://github.com/mholt/PapaParse/blob/ae73d2a96639beec58a83326de6bd8e8ca0c02b3/papaparse.js#L863 https://github.com/mholt/PapaParse/blob/ae73d2a96639beec58a83326de6bd8e8ca0c02b3/papaparse.js#L506
So when papaparse reads a chunk, it directly calls .toString
of that chunk.
However a chunk consists of bytes, and some utf8-characters are two bytes long:
ä
consists of two bytes:11000011
and10100100
a
(and other “regular” characters) is just one byte01100001
Now if the chunk splits right between a multi-byte character like ä
, papaparse calls toString
to both parts of the character distinctly, and produces two weird characters:
11000011
(from end of first chunk) transforms to �
10100100
(from start of second chunk) transforms to �
How to fix this issue?
If received chunks of bytes, the concatenation should be done in bytes too, e.g. using Buffer.concat
. Papaparse should not call toString
before it has split the stream to lines, so the _partialLine
remains as a buffer rather than a string type.
Issue Analytics
- State:
- Created 4 years ago
- Comments:5
@nichgalea if you don’t mind the extra memory usage, I believe you can call
.text()
on the file and pass the it to Papa parse as string.So something like:
Workaround
I implemented a workaround for this issue: Use another library to pre-parse the lines as a stream.
I’m using
delimiter-stream
NPM package that seems to have a correct implementation of line parsing as a byte stream:https://github.com/peterhaldbaek/delimiter-stream/blob/043346af778986d63a7ba0f87b94c3df0bc425d4/delimiter-stream.js#L46
Using this library you can do a simple wrapper to wrap up your stream:
Using this helper function you can wrap the stream before passing it to
Papa.parse
: