Use `md5_binary()` instead of `md5()` for `hash` function on Snowflake
See original GitHub issueDescribe the bug
Reportedly, Snowflake cannot properly micro-partition the UUID string generated by md5
, and performs significantly better doing joins with md5_binary
.
Steps to reproduce
Join a few large tables on surrogate keys in Snowflake.
Expected results
Performs something akin to a join on an integer id, or at the very least as well as a join across multiple columns (for instance an id and timestamp).
Actual results
Performs worse than joining on multiple columns, which give Snowflake information on partitioning.
System information
Which database are you using dbt with?
- snowflake
Issue Analytics
- State:
- Created 3 years ago
- Reactions:1
- Comments:7 (3 by maintainers)
Top Results From Across the Web
Use `md5_binary()` instead of `md5()` for `hash` function on ...
Performs something akin to a join on an integer id, or at the very least as well as a join across multiple columns...
Read more >MD5 , MD5_HEX - Snowflake Documentation
Returns a 32-character hex-encoded string containing the 128-bit MD5 message ... If you need to encrypt and decrypt data, use the following functions:....
Read more >Use Sha vs md5 or Hash in Snowflake-db - Stack Overflow
The built-in hash function should be good enough if you are ok accepting some conflicts. It can be quite much faster than MD5/SHA...
Read more >Validating a Python Hash Function Inside Snowflake - Medium
A customer of mine uses MD5 hashing to group their users into a bunch of groups, randomly but fairly evenly. The idea here...
Read more >Data Vault 2.0 on Snowflake. To hash or not to hash ... - LinkedIn
Instead the end-date is virtualised with the use SQL window functions to infer the current record end-date by fetching the next start date...
Read more >
Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free
Top Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Why not make this optional via an argument? That way new implementation could use the binary key and older models wouldn’t break. The old string version could be deprecated at some point by switching the default to binary.
yep yep - i have some ideas around that – i want to test it on a big data set first and make sure it’s enough of an improvement to warrant it at all first though. open to any suggestions on that front if something good comes to mind.