Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

expression_is_true is costly when applied to a large table

See original GitHub issue

Describe the bug

When running an expression_is_true test I noticed that the test required ~500gb of data (on BigQuery), which in my opinion is extremely costly for a simple test.

Because the way the test is setup (SELECT * in the last statement), the total cost of the test is the same as doing a SELECT * FROM TABLE_THAT_WE_TEST. Which we know can be quite expensive for long and wide tables.

Steps to reproduce

CREATE A LARGE TABLE
Run on BQ SELECT * FROM LARGE_TABLE and check the cost
Run a expression_is_true test against LARGE_TABLE and see that it is equally costly as the SELECT *

Expected results

I expect a simple test to be really cheap. I do not want to take into account the cost of a simple column test when developing.

Actual results

Its expensive.

Screenshots and log output

If applicable, add screenshots or log output to help explain your problem.

System information

The contents of your packages.yml file:

Which database are you using dbt with?

postgres
redshift
[X ] bigquery
snowflake
other (specify: ____________)

The output of dbt --version:

<output goes here>

Additional context

Add any other context about the problem here. For example, if you think you know which line of code is causing the issue.

Are you interested in contributing the fix?

Sure!

expression_is_true.sql:


{% test expression_is_true(model, expression, column_name=None, condition='1=1') %}
{# T-SQL has no boolean data type so we use 1=1 which returns TRUE #}
{# ref https://stackoverflow.com/a/7170753/3842610 #}
  {{ return(adapter.dispatch('test_expression_is_true', 'dbt_utils')(model, expression, column_name, condition)) }}
{% endtest %}

{% macro default__test_expression_is_true(model, expression, column_name, condition) %}

with meet_condition as (
    select * from {{ model }} where {{ condition }}
)

select
    1 -- Change the * to a fixed single column, there might be some relevant info you could pass here for debugging, but just not ALL columns.
from meet_condition
{% if column_name is none %}
where not({{ expression }})
{%- else %}
where not({{ column_name }} {{ expression }})
{%- endif %}

{% endmacro %}

Issue Analytics

State:
Created a year ago
Reactions:1
Comments:5 (4 by maintainers)

Top GitHub Comments

1reaction

joellabescommented, Sep 28, 2022

adding the ability to specify a set of columns to extract as well

@elyobo my gut reaction here is that this would be something that would make sense to define across multiple tests. So I’d recommend you open an issue in dbt-core. Something like this would be cool:

models:
  - name: my_model
    tests: 
      - dbt_utils.expression_is_true:
          expression: a + b = c
          debugging_columns: [a, b, c, d] #horrible name do not use this
    columns: 
      - name: id
         tests: 
           - not_null:
               debugging_columns: [id, e, f] #See that this is available to any test

And then we’d something like

{% macro default__get_debugging_columns(expression) %}
    {% if should_store_failures() and debugging_columns is not none %}
      {{ debugging_columns.join(", ") }}
    {% else %}
      * {# Should we default to pulling everything out where that access isn't billed? If not, then we don't need a second BQ-specific version of this #}
    {% endif %}
{% endmacro %}

{% macro bigquery__get_debugging_columns(expression) %}
    {% if should_store_failures() %}
      {% if debugging_columns is not none %}
        {{ debugging_columns.join(", ") }}
      {% else %}
        *
      {% endif %}
    {% else %}
      {{ expression }} {#or maybe just `1` 🤷 #}
    {% endif %}
{% endmacro %}

1reaction

basdunncommented, Sep 21, 2022

BTW, I havent checked whether this same thing is happening in other tests. Might be valuable to check this.