question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

ReadableStreamStreamer sometimes breaks UTF8-characters to ��

See original GitHub issue

Issue

If I pass a CSV stream to Papa.parse that contains special characters, it sometimes breaks the special characters so they show up as e.g. ��.

How to reproduce

See example at: https://repl.it/repls/ArcticTreasuredLine

Press “Run” at the top of the page

What should happen?

There should only be output of ä character

What happened?

There’s random occurrences of ��

Root cause

These two lines are responsible for this issue: https://github.com/mholt/PapaParse/blob/ae73d2a96639beec58a83326de6bd8e8ca0c02b3/papaparse.js#L863 https://github.com/mholt/PapaParse/blob/ae73d2a96639beec58a83326de6bd8e8ca0c02b3/papaparse.js#L506

So when papaparse reads a chunk, it directly calls .toString of that chunk.

However a chunk consists of bytes, and some utf8-characters are two bytes long:

  • ä consists of two bytes: 11000011 and 10100100
  • a (and other “regular” characters) is just one byte 01100001

Now if the chunk splits right between a multi-byte character like ä, papaparse calls toString to both parts of the character distinctly, and produces two weird characters:

11000011 (from end of first chunk) transforms to 10100100 (from start of second chunk) transforms to

How to fix this issue?

If received chunks of bytes, the concatenation should be done in bytes too, e.g. using Buffer.concat. Papaparse should not call toString before it has split the stream to lines, so the _partialLine remains as a buffer rather than a string type.

Issue Analytics

  • State:open
  • Created 4 years ago
  • Comments:5

github_iconTop GitHub Comments

1reaction
jehnacommented, Jun 1, 2020

@nichgalea if you don’t mind the extra memory usage, I believe you can call .text() on the file and pass the it to Papa parse as string.

So something like:

Papa.parse(await file.text());
1reaction
jehnacommented, Dec 10, 2019

Workaround

I implemented a workaround for this issue: Use another library to pre-parse the lines as a stream.

I’m using delimiter-stream NPM package that seems to have a correct implementation of line parsing as a byte stream:

https://github.com/peterhaldbaek/delimiter-stream/blob/043346af778986d63a7ba0f87b94c3df0bc425d4/delimiter-stream.js#L46

Using this library you can do a simple wrapper to wrap up your stream:

const toLineDelimitedStream = input => {
  // Two-byte UTF characters (such as "ä") can break because the chunk might get
  // split at the middle of the character, and papaparse parses the byte stream
  // incorrectly. We can use `DelimiterStream` to fix this, as it parses the
  // chunks to lines correctly before passing the data to papaparse.
  const output = new DelimiterStream()
  input.pipe(output)
  return output
}

Using this helper function you can wrap the stream before passing it to Papa.parse:

Papa.parse(toLineDelimitedStream(stream), {
   ...
})
Read more comments on GitHub >

github_iconTop Results From Across the Web

java - inputStream and utf 8 sometimes shows "?" characters
To read characters from a byte stream with a given encoding, use a Reader . In your case it would be something like:...
Read more >
Stream | Node.js v19.3.0 Documentation
Both Writable and Readable streams will store data in an internal buffer. The amount of data potentially buffered depends on the highWaterMark option...
Read more >
9 Native Node.js streams
Readable streams are streams from which we can read data. In other words, they are sources of data. An example is a readable...
Read more >
Reading UTF-8 with C++ streams - CodeProject
This article is about reading and writing Unicode to character streams in UTF-8 encoding. And as a consequence is about an often mis-known ......
Read more >
Understanding Streams in Node.js - NodeSource
createReadStream('file.txt'); //Create a readable stream readerStream.setEncoding('UTF8'); // Set the encoding to be utf8.
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found