question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Replace information_schema queries with faster alternatives on Snowflake

See original GitHub issue

Describe the feature

When dbt starts, it runs a query to the information_schema for every schema in the project. This happens even if the run involves a single model (single schema).

Each of these queries is taking anywhere from 4-20 seconds, presumably depending on how much load the overall Snowflake system has across accounts.

These queries seem to be running on the main thread and are therefore sequential. We have a project with 9 schemas with a time-to-first-model of close to 90 seconds. As you can imagine, this is a huge productivity drag.

We are contacting Snowflake about speeding up information_schema queries but this could also be improved if dbt ran these queries in multiple threads and if it only ran queries for the schemas involved in the given run.

Also, I believe the show tables or show views commands could be used in this particular case (these take in the order of 100-200 ms) instead of queries to the information schema.

Below is one of these queries which took over 12 seconds:

2019-10-29 12:00:19,554 (MainThread): Acquiring new snowflake connection "list_relations_without_caching".
2019-10-29 12:00:19,554 (MainThread): Re-using an available connection from the pool.
2019-10-29 12:00:19,554 (MainThread): Using snowflake connection "list_relations_without_caching".
2019-10-29 12:00:19,554 (MainThread): On list_relations_without_caching: BEGIN
2019-10-29 12:00:20,197 (MainThread): SQL status: SUCCESS 1 in 0.64 seconds
2019-10-29 12:00:20,197 (MainThread): Using snowflake connection "list_relations_without_caching".
2019-10-29 12:00:20,197 (MainThread): On list_relations_without_caching: select
      table_catalog as database,
      table_name as name,
      table_schema as schema,
      case when table_type = 'BASE TABLE' then 'table'
           when table_type = 'VIEW' then 'view'
           when table_type = 'MATERIALIZED VIEW' then 'materializedview'
           else table_type
      end as table_type
    from aibi_analytics_db.information_schema.tables
    where table_schema ilike 'dbt_pedro_sol_matching'
      and table_catalog ilike 'aibi_analytics_db'
2019-10-29 12:00:32,862 (MainThread): SQL status: SUCCESS 19 in 12.66 seconds

Describe alternatives you’ve considered

I inquired whether a macro could be used to override the information schema queries but was told it’s not possible.

Additional context

Snowflake

Who will this benefit?

This will speed time-to-first-model for Snowflake projects with multiple schemas

Issue Analytics

  • State:closed
  • Created 4 years ago
  • Reactions:8
  • Comments:17 (9 by maintainers)

github_iconTop GitHub Comments

1reaction
pedromachadoscommented, Nov 1, 2019

@drewbanin Thanks for looking into this.

What if you run show schemas in database <database> first and then do the case-insensitive search in python to find the correct capitalization of the schema name? Then you can use it to run show tables in schema <database>.<schema> with the correct capitalization.

If you give me some pointers on where to go in the code I could take a stab at this and create a PR over the next couple of weeks.

1reaction
drewbanincommented, Nov 1, 2019

Hey @pedromachados - I’m not sure that we’ll want to start with parallelizing these queries - I’d be much more in favor of using show schemas, show tables, etc etc in lieu of information_schema queries! Even if we did parallelize these, if one of them takes 20 seconds to complete, that’s still too slow for us to work with IMO.

I think we discussed this on Slack, but there are some real challenges we’d need to account for in using show... instead of select .. from information_schema.<table>.

For one, the show ... queries only return a maximum of something like 10k objects. If we tried to run show tables in database ..., there’s a super real chance that we’d hit this maximum in even moderately sized warehouses! So, we’d need to use show tables in schema <database>.<schema> which is also challenging because we’d need to quote these identifiers exactly correctly. This is super doable for dbt, but quoting on Snowflake is always a big pain!

For two, show columns returns different data than the results returned from the information schema. This might be tractable for us, but it’s a big change for us to make!

I’m super keen to male this change - going to queue it up for a future release.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Working with Materialized Views - Snowflake Documentation
Materialized views are faster than tables because of their “cache” (i.e. the query results for the view); in addition, if data has changed,...
Read more >
Diagnosing Slow Snowflake Query Performance - Rockset
In this post we'll discuss why Snowflake queries are slow and options you have to achieve better Snowflake query performance.
Read more >
Snowflake Materialized Views: A Comprehensive Guide 101
This blog provides an overview of Snowflake Materialized Views. You will also learn how to create, join, grant permissions and estimate and ...
Read more >
How To Extract Snowflake Data Observability Metrics Using ...
To run the queries below in your environment, simply replace ANALYTICS with the name of the database you are looking to track. To...
Read more >
Query multiple tables using a wildcard table | BigQuery
Wildcard tables enable you to query multiple tables using concise SQL statements. Wildcard tables are available only in Google Standard SQL.
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found