question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Use `md5_binary()` instead of `md5()` for `hash` function on Snowflake

See original GitHub issue

Describe the bug

Reportedly, Snowflake cannot properly micro-partition the UUID string generated by md5, and performs significantly better doing joins with md5_binary.

Steps to reproduce

Join a few large tables on surrogate keys in Snowflake.

Expected results

Performs something akin to a join on an integer id, or at the very least as well as a join across multiple columns (for instance an id and timestamp).

Actual results

Performs worse than joining on multiple columns, which give Snowflake information on partitioning.

System information

Which database are you using dbt with?

  • snowflake

Issue Analytics

  • State:open
  • Created 3 years ago
  • Reactions:1
  • Comments:7 (3 by maintainers)

github_iconTop GitHub Comments

1reaction
clausherthercommented, Dec 5, 2020

Why not make this optional via an argument? That way new implementation could use the binary key and older models wouldn’t break. The old string version could be deprecated at some point by switching the default to binary.

1reaction
ghostcommented, Dec 4, 2020

yep yep - i have some ideas around that – i want to test it on a big data set first and make sure it’s enough of an improvement to warrant it at all first though. open to any suggestions on that front if something good comes to mind.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Use `md5_binary()` instead of `md5()` for `hash` function on ...
Performs something akin to a join on an integer id, or at the very least as well as a join across multiple columns...
Read more >
MD5 , MD5_HEX - Snowflake Documentation
Returns a 32-character hex-encoded string containing the 128-bit MD5 message ... If you need to encrypt and decrypt data, use the following functions:....
Read more >
Use Sha vs md5 or Hash in Snowflake-db - Stack Overflow
The built-in hash function should be good enough if you are ok accepting some conflicts. It can be quite much faster than MD5/SHA...
Read more >
Validating a Python Hash Function Inside Snowflake - Medium
A customer of mine uses MD5 hashing to group their users into a bunch of groups, randomly but fairly evenly. The idea here...
Read more >
Data Vault 2.0 on Snowflake. To hash or not to hash ... - LinkedIn
Instead the end-date is virtualised with the use SQL window functions to infer the current record end-date by fetching the next start date...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found