Unicode characters outside the basic multilingual plane can cause an ArgumentException in WriteStringChunked if the write buffer is full or nearly full
See original GitHub issueSteps to reproduce
Unit test below, the “does not throw” assertion fails when there is 0, 1, 2 or 3 bytes of space left in the write buffer:
[Test, IssueLink("TODO")]
public void ChunkedStringEncodingSurrogatePairs()
{
WriteBuffer.WriteBytes(new byte[WriteBuffer.Size - 1], 0, WriteBuffer.Size - 1);
Assert.That(WriteBuffer.WriteSpaceLeft, Is.EqualTo(1));
var charsUsed = 1;
var completed = true;
// This string has length two because it contains a character outside the Basic Multilingual Plane
// In UTF-16, which is what .NET uses internally for string representation, this cyclone is represented via a surrogate pair.
var cyclone = "🌀";
Assert.That(() => WriteBuffer.WriteStringChunked(cyclone, 0, cyclone.Length, true, out charsUsed, out completed), Throws.Nothing);
Assert.That(charsUsed, Is.EqualTo(0));
Assert.That(completed, Is.False);
}
The issue
Very similar to https://github.com/npgsql/npgsql/issues/2849 which I raised and subsequently fixed in https://github.com/npgsql/npgsql/pull/2850, but my fix did not account for unicode characters outside the basic multilingual plane, which are represented in UTF-16 as inseparable surrogate pairs (https://en.wikipedia.org/wiki/Universal_Character_Set_characters#Surrogates).
When the write buffer is full or nearly full, the checks I introduced don’t work when sPtr + charIndex
points to one of these characters; the right hand side of this inequality evaluates to 0
if (WriteSpaceLeft < _textEncoder.GetByteCount(sPtr + charIndex, 1, flush: false))
so we end up calling
_textEncoder.Convert(sPtr + charIndex, charCount, bufPtr + WritePosition, WriteSpaceLeft, flush, out charsUsed, out bytesUsed, out completed);
which then throws since WriteSpaceLeft
is less than 4, which is the required number of bytes to convert the the character into UTF-8 (I think… I don’t really understand why the second and fourth lines below return different values…)
var cyclone = "🌀";
Encoding.UTF8.GetEncoder().GetByteCount(cyclone.ToCharArray(), 0, 1, false); // returns 0
Encoding.UTF8.GetEncoder().GetByteCount(cyclone.ToCharArray(), 0, 2, false); // returns 4
Encoding.UTF8.GetByteCount(cyclone.ToCharArray(), 0, 1); // returns 3
Encoding.UTF8.GetByteCount(cyclone.ToCharArray(), 0, 2); // returns 4
I think this can be fixed by changing the count
argument in GetByteCount
to Math.Min(2, charCount)
to account for these surrogate pairs, but honestly I don’t really know enough about unicode and encoding to be confident if that’s a correct fix, or whether the pragmatic thing to do is change the WriteSpaceLeft check to 4 bytes (more?), with the trade-off being that’d we’d flush unnecessarily sometimes?
Stack trace:
System.ArgumentException: The output byte buffer is too small to contain the encoded data, encoding 'Unicode (UTF-8)' fallback 'System.Text.EncoderExceptionFallback'. (Parameter 'bytes')
at System.Text.Encoding.ThrowBytesOverflow()
at System.Text.Encoding.ThrowBytesOverflow(EncoderNLS encoder, Boolean nothingEncoded)
at System.Text.Encoding.GetBytesWithFallback(ReadOnlySpan`1 chars, Int32 originalCharsLength, Span`1 bytes, Int32 originalBytesLength, EncoderNLS encoder)
at System.Text.Encoding.GetBytesWithFallback(Char* pOriginalChars, Int32 originalCharCount, Byte* pOriginalBytes, Int32 originalByteCount, Int32 charsConsumedSoFar, Int32 bytesWrittenSoFar, EncoderNLS encoder)
at System.Text.Encoding.GetBytes(Char* pChars, Int32 charCount, Byte* pBytes, Int32 byteCount, EncoderNLS encoder)
at System.Text.EncoderNLS.Convert(Char* chars, Int32 charCount, Byte* bytes, Int32 byteCount, Boolean flush, Int32& charsUsed, Int32& bytesUsed, Boolean& completed)
at Npgsql.NpgsqlWriteBuffer.WriteStringChunked(String s, Int32 charIndex, Int32 charCount, Boolean flush, Int32& charsUsed, Boolean& completed)
at Npgsql.NpgsqlWriteBuffer.<WriteString>g__WriteStringLong|58_0(NpgsqlWriteBuffer buffer, Boolean async, String s, Int32 charLen, Int32 byteLen, CancellationToken cancellationToken)
at Npgsql.TypeHandlers.ArrayHandler`1.WriteGeneric(ICollection`1 value, NpgsqlWriteBuffer buf, NpgsqlLengthCache lengthCache, Boolean async, CancellationToken cancellationToken)
at Npgsql.NpgsqlBinaryImporter.Write[T](T value, NpgsqlParameter param, Boolean async, CancellationToken cancellationToken)
at <OUR CODE>
Further technical details
Npgsql version: 5.0.4 PostgreSQL version: 11.10 Operating system: Linux and Windows
Issue Analytics
- State:
- Created 2 years ago
- Comments:6 (6 by maintainers)
Merged this for 5.0.6 and 4.1.10, thanks @chrisdcmoore!
This returns 0 because it’s assumed the
Encoder
will be called repeatedly (for sequential chunks of the input), so it maintains internal state between calls. If you were to callEncoder.GetBytes
with that input (i.e.,"\uD83C"
), it would write 0 bytes to the output, store that leading surrogate in its internal buffer and wait for the trailing surrogate ('\uDF00'
) to be provided, at which point 4 bytes (F0 9F 8C 80
) would be written. If, on the other hand, the next chunk of input text didn’t start with a valid trailing surrogate, 3 bytes (EF BF FD
) for the Unicode Replacement Character, U+FFFD, would be written instead. (Note that onlyGetBytes
updates the encoder’s internal state;GetByteCount
does not. It’s only valid to callGetByteCount
for the immediate next chunk of text that follows the text already passed toGetBytes
.)This returns 3 because
Encoding.GetString
assumes it’s processing all the input in one go. A leading surrogate by itself is invalid, so it gets replaced with the Unicode Replacement Character, U+FFFD. This is converted to the three UTF-8 bytesEF BF BD
.