Incorrect hash result from murmur3_32 with String input containing surrogate pairs
See original GitHub issueHere is the test program comparing behavior of
- Guava’s
Hashing.murmur3_32
withhashBytes(str.getBytes(encoding))
- Guava’s
Hashing.murmur3_32
withhashString(str, encoding)
- Airlift’s
Murmur3Hash32.hash(str.getBytes(encoding))
(https://github.com/airlift/slice/blob/d3a025291fd8d6a062e44f3823ee49196783ab9c/src/main/java/io/airlift/slice/Murmur3Hash32.java)
public class Murmur
{
public static void main(String[] args)
{
for (String string : List.of("plan ascii", "BMP Piękna łąka w 東京都", "surrogate pair 💰")) {
for (Charset charset : List.of(StandardCharsets.UTF_16, StandardCharsets.UTF_8)) {
int airliftValue = Murmur3Hash32.hash(Slices.wrappedBuffer(string.getBytes(charset)));
int guavaValue1 = Hashing.murmur3_32().hashBytes(string.getBytes(charset)).asInt();
int guavaValue2 = Hashing.murmur3_32().hashString(string, charset).asInt();
System.out.println("airliftValue = " + airliftValue);
System.out.println("guavaValue1 = " + guavaValue1);
System.out.println("guavaValue2 = " + guavaValue2);
System.out.println();
}
}
}
}
Results when run with Guava 30.1-jre
and Airlift Slice 0.40
:
Note that results agree except for the last case. I expect them to agree in all cases, especially between the two Guava alternative API methods.
string: plan ascii, charset: UTF-16
airliftValue = -731716445
guavaValue1 = -731716445
guavaValue2 = -731716445
string: plan ascii, charset: UTF-8
airliftValue = -218266838
guavaValue1 = -218266838
guavaValue2 = -218266838
string: BMP Piękna łąka w 東京都, charset: UTF-16
airliftValue = -989030725
guavaValue1 = -989030725
guavaValue2 = -989030725
string: BMP Piękna łąka w 東京都, charset: UTF-8
airliftValue = 103331700
guavaValue1 = 103331700
guavaValue2 = 103331700
string: surrogate pair 💰, charset: UTF-16
airliftValue = 2147098392
guavaValue1 = 2147098392
guavaValue2 = 2147098392
string: surrogate pair 💰, charset: UTF-8
airliftValue = -1114908744
guavaValue1 = -1114908744
guavaValue2 = -2027737699
Issue Analytics
- State:
- Created 2 years ago
- Comments:6 (6 by maintainers)
Top Results From Across the Web
Sha256 Unicode test failed · Issue #3 · HaxeFoundation/crypto
Bytes.fromString that correctly gives UTF8 bytes. Another option would be a way to iterate on Unicode chars without surrogate pairs but we don't ......
Read more >Handling Unicode surrogate values in Java strings
I believe that the problem is that you haven't specified a proper surrogate pair. You should specify bytes representing a low surrogate and ......
Read more >Node's Unicode Dragon - Conrad Irwin
Unicode strings must never contain code points in that range. They took these unused code-points (known as surrogates) and used them in pairs...
Read more >JavaScript has a Unicode problem - Mathias Bynens
Using surrogate pairs, all astral code points (i.e. from U+010000 to ... it into an array of strings that each contain a single...
Read more >webnative - UNPKG
node_modules/fission-bloom-filters/dist/sketch/min-hash-factory.js", ". ... a hex string, for example, that contains invalid characters will\n // cause ...
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
Our current plan is to deprecate
Hashing.murmur3_32()
and introduce a newHashing.murmur3_32_fixed()
that is identical except that it has correct behaviour for UTF-8 strings with non-BMP characters. That way we avoid breaking anyone who was using the current incorrect hash values to form keys to persistent storage.cc @rdblue as Iceberg uses the allegedly affected code path when bucketing string/varchar data. See https://github.com/trinodb/trino/pull/8104/files#r655190551 for context. kudos to @wendigo for finding this.