Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Unicode characters outside the basic multilingual plane can cause an ArgumentException in WriteStringChunked if the write buffer is full or nearly full

See original GitHub issue

Steps to reproduce

Unit test below, the “does not throw” assertion fails when there is 0, 1, 2 or 3 bytes of space left in the write buffer:

[Test, IssueLink("TODO")]
public void ChunkedStringEncodingSurrogatePairs()
{
    WriteBuffer.WriteBytes(new byte[WriteBuffer.Size - 1], 0, WriteBuffer.Size - 1);
    Assert.That(WriteBuffer.WriteSpaceLeft, Is.EqualTo(1));

    var charsUsed = 1;
    var completed = true;
    // This string has length two because it contains a character outside the Basic Multilingual Plane
    // In UTF-16, which is what .NET uses internally for string representation, this cyclone is represented via a surrogate pair.
    var cyclone = "🌀";

    Assert.That(() => WriteBuffer.WriteStringChunked(cyclone, 0, cyclone.Length, true, out charsUsed, out completed), Throws.Nothing);
    Assert.That(charsUsed, Is.EqualTo(0));
    Assert.That(completed, Is.False);
}

The issue

Very similar to https://github.com/npgsql/npgsql/issues/2849 which I raised and subsequently fixed in https://github.com/npgsql/npgsql/pull/2850, but my fix did not account for unicode characters outside the basic multilingual plane, which are represented in UTF-16 as inseparable surrogate pairs (https://en.wikipedia.org/wiki/Universal_Character_Set_characters#Surrogates).

When the write buffer is full or nearly full, the checks I introduced don’t work when sPtr + charIndex points to one of these characters; the right hand side of this inequality evaluates to 0

if (WriteSpaceLeft < _textEncoder.GetByteCount(sPtr + charIndex, 1, flush: false))

so we end up calling

_textEncoder.Convert(sPtr + charIndex, charCount, bufPtr + WritePosition, WriteSpaceLeft, flush, out charsUsed, out bytesUsed, out completed);

which then throws since WriteSpaceLeft is less than 4, which is the required number of bytes to convert the the character into UTF-8 (I think… I don’t really understand why the second and fourth lines below return different values…)

    var cyclone = "🌀";
    Encoding.UTF8.GetEncoder().GetByteCount(cyclone.ToCharArray(), 0, 1, false); // returns 0
    Encoding.UTF8.GetEncoder().GetByteCount(cyclone.ToCharArray(), 0, 2, false); // returns 4
    Encoding.UTF8.GetByteCount(cyclone.ToCharArray(), 0, 1); // returns 3
    Encoding.UTF8.GetByteCount(cyclone.ToCharArray(), 0, 2); // returns 4

I think this can be fixed by changing the count argument in GetByteCount to Math.Min(2, charCount) to account for these surrogate pairs, but honestly I don’t really know enough about unicode and encoding to be confident if that’s a correct fix, or whether the pragmatic thing to do is change the WriteSpaceLeft check to 4 bytes (more?), with the trade-off being that’d we’d flush unnecessarily sometimes?

Stack trace:
System.ArgumentException: The output byte buffer is too small to contain the encoded data, encoding 'Unicode (UTF-8)' fallback 'System.Text.EncoderExceptionFallback'. (Parameter 'bytes')
	   at System.Text.Encoding.ThrowBytesOverflow()
	   at System.Text.Encoding.ThrowBytesOverflow(EncoderNLS encoder, Boolean nothingEncoded)
	   at System.Text.Encoding.GetBytesWithFallback(ReadOnlySpan`1 chars, Int32 originalCharsLength, Span`1 bytes, Int32 originalBytesLength, EncoderNLS encoder)
	   at System.Text.Encoding.GetBytesWithFallback(Char* pOriginalChars, Int32 originalCharCount, Byte* pOriginalBytes, Int32 originalByteCount, Int32 charsConsumedSoFar, Int32 bytesWrittenSoFar, EncoderNLS encoder)
	   at System.Text.Encoding.GetBytes(Char* pChars, Int32 charCount, Byte* pBytes, Int32 byteCount, EncoderNLS encoder)
	   at System.Text.EncoderNLS.Convert(Char* chars, Int32 charCount, Byte* bytes, Int32 byteCount, Boolean flush, Int32& charsUsed, Int32& bytesUsed, Boolean& completed)
	   at Npgsql.NpgsqlWriteBuffer.WriteStringChunked(String s, Int32 charIndex, Int32 charCount, Boolean flush, Int32& charsUsed, Boolean& completed)
	   at Npgsql.NpgsqlWriteBuffer.<WriteString>g__WriteStringLong|58_0(NpgsqlWriteBuffer buffer, Boolean async, String s, Int32 charLen, Int32 byteLen, CancellationToken cancellationToken)
	   at Npgsql.TypeHandlers.ArrayHandler`1.WriteGeneric(ICollection`1 value, NpgsqlWriteBuffer buf, NpgsqlLengthCache lengthCache, Boolean async, CancellationToken cancellationToken)
	   at Npgsql.NpgsqlBinaryImporter.Write[T](T value, NpgsqlParameter param, Boolean async, CancellationToken cancellationToken)
	   at <OUR CODE>

Further technical details

Npgsql version: 5.0.4 PostgreSQL version: 11.10 Operating system: Linux and Windows

Issue Analytics

State:
Created 2 years ago
Comments:6 (6 by maintainers)

Top GitHub Comments

1reaction

rojicommented, May 29, 2021

Merged this for 5.0.6 and 4.1.10, thanks @chrisdcmoore!

1reaction

bgraingercommented, May 14, 2021

I don’t really understand why the second and fourth lines below return different values

Encoding.UTF8.GetEncoder().GetByteCount(cyclone.ToCharArray(), 0, 1, false); // returns 0

This returns 0 because it’s assumed the Encoder will be called repeatedly (for sequential chunks of the input), so it maintains internal state between calls. If you were to call Encoder.GetBytes with that input (i.e., "\uD83C"), it would write 0 bytes to the output, store that leading surrogate in its internal buffer and wait for the trailing surrogate ('\uDF00') to be provided, at which point 4 bytes (F0 9F 8C 80) would be written. If, on the other hand, the next chunk of input text didn’t start with a valid trailing surrogate, 3 bytes (EF BF FD) for the Unicode Replacement Character, U+FFFD, would be written instead. (Note that only GetBytes updates the encoder’s internal state; GetByteCount does not. It’s only valid to call GetByteCount for the immediate next chunk of text that follows the text already passed to GetBytes.)

Encoding.UTF8.GetByteCount(cyclone.ToCharArray(), 0, 1); // returns 3

This returns 3 because Encoding.GetString assumes it’s processing all the input in one go. A leading surrogate by itself is invalid, so it gets replaced with the Unicode Replacement Character, U+FFFD. This is converted to the three UTF-8 bytes EF BF BD.

Top Results From Across the Web

Forcing Encoding.UTF8.GetString to throw an ...

GetString(Byte[]) MSDN documentation I find that it can throw an ArgumentException if: The byte array contains invalid Unicode code points. What ...

ArgumentOutOfRangeException Class (System)

Initializes a new instance of the ArgumentOutOfRangeException class with the name of the parameter that causes this exception and a specified error message....

C# Exception Guide: ArgumentOutOfRangeException

The ArgumentOutOfRangeException exception is thrown when the argument passed to a method is not null and contains a value that is not within...

Throwing ArgumentException and InvalidOperationException

An exception is thrown when an error is encountered in a running application, whether from bad code or bad user input.

System.ArgumentException Class

ArgumentException is thrown when a method is invoked and at least one of the passed arguments does not meet the method's parameter specification....