question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

expression_is_true is costly when applied to a large table

See original GitHub issue

Describe the bug

When running an expression_is_true test I noticed that the test required ~500gb of data (on BigQuery), which in my opinion is extremely costly for a simple test.

Because the way the test is setup (SELECT * in the last statement), the total cost of the test is the same as doing a SELECT * FROM TABLE_THAT_WE_TEST. Which we know can be quite expensive for long and wide tables.

Steps to reproduce

  1. CREATE A LARGE TABLE
  2. Run on BQ SELECT * FROM LARGE_TABLE and check the cost
  3. Run a expression_is_true test against LARGE_TABLE and see that it is equally costly as the SELECT *

Expected results

I expect a simple test to be really cheap. I do not want to take into account the cost of a simple column test when developing.

Actual results

Its expensive.

Screenshots and log output

If applicable, add screenshots or log output to help explain your problem.

System information

The contents of your packages.yml file:

Which database are you using dbt with?

  • postgres
  • redshift
  • [X ] bigquery
  • snowflake
  • other (specify: ____________)

The output of dbt --version:

<output goes here>

Additional context

Add any other context about the problem here. For example, if you think you know which line of code is causing the issue.

Are you interested in contributing the fix?

Sure!

expression_is_true.sql:


{% test expression_is_true(model, expression, column_name=None, condition='1=1') %}
{# T-SQL has no boolean data type so we use 1=1 which returns TRUE #}
{# ref https://stackoverflow.com/a/7170753/3842610 #}
  {{ return(adapter.dispatch('test_expression_is_true', 'dbt_utils')(model, expression, column_name, condition)) }}
{% endtest %}

{% macro default__test_expression_is_true(model, expression, column_name, condition) %}

with meet_condition as (
    select * from {{ model }} where {{ condition }}
)

select
    1 -- Change the * to a fixed single column, there might be some relevant info you could pass here for debugging, but just not ALL columns.
from meet_condition
{% if column_name is none %}
where not({{ expression }})
{%- else %}
where not({{ column_name }} {{ expression }})
{%- endif %}

{% endmacro %}

Issue Analytics

  • State:closed
  • Created a year ago
  • Reactions:1
  • Comments:5 (4 by maintainers)

github_iconTop GitHub Comments

1reaction
joellabescommented, Sep 28, 2022

adding the ability to specify a set of columns to extract as well

@elyobo my gut reaction here is that this would be something that would make sense to define across multiple tests. So I’d recommend you open an issue in dbt-core. Something like this would be cool:

models:
  - name: my_model
    tests: 
      - dbt_utils.expression_is_true:
          expression: a + b = c
          debugging_columns: [a, b, c, d] #horrible name do not use this
    columns: 
      - name: id
         tests: 
           - not_null:
               debugging_columns: [id, e, f] #See that this is available to any test

And then we’d something like

{% macro default__get_debugging_columns(expression) %}
    {% if should_store_failures() and debugging_columns is not none %}
      {{ debugging_columns.join(", ") }}
    {% else %}
      * {# Should we default to pulling everything out where that access isn't billed? If not, then we don't need a second BQ-specific version of this #}
    {% endif %}
{% endmacro %}

{% macro bigquery__get_debugging_columns(expression) %}
    {% if should_store_failures() %}
      {% if debugging_columns is not none %}
        {{ debugging_columns.join(", ") }}
      {% else %}
        *
      {% endif %}
    {% else %}
      {{ expression }} {#or maybe just `1` 🤷 #}
    {% endif %}
{% endmacro %}
1reaction
basdunncommented, Sep 21, 2022

BTW, I havent checked whether this same thing is happening in other tests. Might be valuable to check this.

Read more comments on GitHub >

github_iconTop Results From Across the Web

CSE Exam 1 Flashcards
Is travel the largest component of the total cost of the trip? Was last year's income greater than this year's income? all of...
Read more >
Examples of expressions
An expression is a combination of mathematical or logical operators, constants, functions, table fields, controls, and properties that evaluates ...
Read more >
EXPLAIN - MariaDB Knowledge Base
A full table scan is done for the table (all rows are read). This is bad if the table is large and the...
Read more >
5 Most Useful SQL Best Practices You Should Follow
However, DISTINCT can be expensive as it will not tell you if the JOINs and filters you used are correct or incorrect which...
Read more >
Chapter 5. Data Access and Change
This metadata information remains constant regardless of changes to the contents of the tables used in the query expression.
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found