question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Incorrect hash result from murmur3_32 with String input containing surrogate pairs

See original GitHub issue

Here is the test program comparing behavior of

public class Murmur
{
    public static void main(String[] args)
    {
        for (String string : List.of("plan ascii", "BMP Piękna łąka w 東京都",  "surrogate pair 💰")) {
            for (Charset charset : List.of(StandardCharsets.UTF_16, StandardCharsets.UTF_8)) {
                int airliftValue = Murmur3Hash32.hash(Slices.wrappedBuffer(string.getBytes(charset)));
                int guavaValue1 = Hashing.murmur3_32().hashBytes(string.getBytes(charset)).asInt();
                int guavaValue2 = Hashing.murmur3_32().hashString(string, charset).asInt();
                System.out.println("airliftValue = " + airliftValue);
                System.out.println("guavaValue1  = " + guavaValue1);
                System.out.println("guavaValue2  = " + guavaValue2);
                System.out.println();
            }
        }
    }
}

Results when run with Guava 30.1-jre and Airlift Slice 0.40:

Note that results agree except for the last case. I expect them to agree in all cases, especially between the two Guava alternative API methods.

string: plan ascii, charset: UTF-16
airliftValue = -731716445
guavaValue1  = -731716445
guavaValue2  = -731716445

string: plan ascii, charset: UTF-8
airliftValue = -218266838
guavaValue1  = -218266838
guavaValue2  = -218266838

string: BMP Piękna łąka w 東京都, charset: UTF-16
airliftValue = -989030725
guavaValue1  = -989030725
guavaValue2  = -989030725

string: BMP Piękna łąka w 東京都, charset: UTF-8
airliftValue = 103331700
guavaValue1  = 103331700
guavaValue2  = 103331700

string: surrogate pair 💰, charset: UTF-16
airliftValue = 2147098392
guavaValue1  = 2147098392
guavaValue2  = 2147098392

string: surrogate pair 💰, charset: UTF-8
airliftValue = -1114908744
guavaValue1  = -1114908744
guavaValue2  = -2027737699

cc @losipiuk @wendigo @alexjo2144

Issue Analytics

  • State:closed
  • Created 2 years ago
  • Comments:6 (6 by maintainers)

github_iconTop GitHub Comments

1reaction
eamonnmcmanuscommented, Sep 2, 2021

Our current plan is to deprecate Hashing.murmur3_32() and introduce a new Hashing.murmur3_32_fixed() that is identical except that it has correct behaviour for UTF-8 strings with non-BMP characters. That way we avoid breaking anyone who was using the current incorrect hash values to form keys to persistent storage.

1reaction
findepicommented, Jul 16, 2021

cc @rdblue as Iceberg uses the allegedly affected code path when bucketing string/varchar data. See https://github.com/trinodb/trino/pull/8104/files#r655190551 for context. kudos to @wendigo for finding this.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Sha256 Unicode test failed · Issue #3 · HaxeFoundation/crypto
Bytes.fromString that correctly gives UTF8 bytes. Another option would be a way to iterate on Unicode chars without surrogate pairs but we don't ......
Read more >
Handling Unicode surrogate values in Java strings
I believe that the problem is that you haven't specified a proper surrogate pair. You should specify bytes representing a low surrogate and ......
Read more >
Node's Unicode Dragon - Conrad Irwin
Unicode strings must never contain code points in that range. They took these unused code-points (known as surrogates) and used them in pairs...
Read more >
JavaScript has a Unicode problem - Mathias Bynens
Using surrogate pairs, all astral code points (i.e. from U+010000 to ... it into an array of strings that each contain a single...
Read more >
webnative - UNPKG
node_modules/fission-bloom-filters/dist/sketch/min-hash-factory.js", ". ... a hex string, for example, that contains invalid characters will\n // cause ...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found