Create zero_pad_timestamp_ms() macro to work around Snowflake and Redshift producing different results in surrogate_key()
See original GitHub issueDescribe the bug
Due to internal mechanics Redshift and Snowflake could produce different surrogate keys on the same data sets. It is very confusing in case you are doing migration or somehow use both of databases. The main issue are connected with casting of timestamp fields. It implements in really strange way in Redshift. As an example:
Input:
select
cast(TIMESTAMP '2021-03-19 17:07:10.123321' as varchar),
cast(TIMESTAMP '2021-03-19 17:07:10.123000' as varchar),
cast(TIMESTAMP '2021-03-19 17:07:10.100000' as varchar),
cast(TIMESTAMP '2021-03-19 17:07:10.000000' as varchar);
Output Redshift:
2021-03-19 17:07:10.123321,
2021-03-19 17:07:10.123,
2021-03-19 17:07:10.10,
2021-03-19 17:07:10
Output Snowflake:
2021-03-19 17:07:10.123321,
2021-03-19 17:07:10.123000,
2021-03-19 17:07:10.100000,
2021-03-19 17:07:10.000000
Steps To Reproduce
Just use dbt https://github.com/dbt-labs/dbt-utils#surrogate_key-source on any value with timestamp.
Expected behavior
I suppose it is better in case surrogate key will produce the same output with any database adapter and will be agnostic to database engine.
Screenshots and log output
If applicable, add screenshots or log output to help explain your problem.
System information
Which database are you using dbt with?
- postgres
- [ x] redshift
- bigquery
- [ x] snowflake
- other (specify: ____________)
Additional context
I’m not sure if I should put the issue into bug section or in feature one. I also not sure if the issues affect a lot of people, may be it no so nessesary to fix. However, in case you faced the issues it will require enormous amount of efforts to fix.
P.S. I could to fix the issue for Redshift adapter to let it produce the same expected output. Just point me where to start.
Issue Analytics
- State:
- Created 2 years ago
- Comments:6 (6 by maintainers)
I’ve removed the 1.0 label, because this can be handled at any time without causing backwards-compatibility issues. The new macro
zero_pad_timestamp_ms()
can be added into any new surrogate key implementations, when people know that cross-database compatibility is important to them.Old:
{{ dbt_utils.surrogate_key(['user_id', 'account_created_at']) }}
New:
{{ dbt_utils.surrogate_key(['user_id', zero_pad_timestamp_ms('account_created_at')]) }}
My only questions now is whether this new macro belongs in dbt_utils, or whether we punt it back to the adapters to conform to an expected implementation defined in Core. @jtcohen6 and @dbeatty10, what say you?
it should still return a string, so I guess it can just be
cast({{ column }} as {{ type_string() }})
but yeah basically nothing!Edit: it doesn’t need to for surrogate_key, which already stringifies things, but I think that other hypothetical consumers should be able to rely on a known datatype coming back