Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Add test groups

See original GitHub issue

Describe the feature

Create a feature to specify “test groups” such that there is a shorthand for specifying several related tests on a model at once. Currently, Fishtown best practices recommend that you specify unique and not_null tests on the primary key of each of your models - these tests are logically related. This feature would allow you specify just a primary key test group on a single field, and would automatically generate and compile both a unique and not_null test for that model on a dbt test invocation. This could be extended to other logically grouped tests on single fields.

Describe alternatives you’ve considered

Alternatives include explicitly calling each test, as it works today! This is still absolutely a viable approach, and has the benefit of forcing analysts to explicitly declare their assumptions about their data.

Additional context

Example:

current state:

version: 2

models:
    - name: cool_data_model
      columns:
          - name: id
            tests:
                - unique
                - not_null

proposed:

version: 2

models:
    - name: cool_data_model
      primary_key: id

Obviously the example is up for debate - might be worth keeping column-level definitions in there, and maybe something that explicitly says the word test but generally, a single specification could make the definition of these tests more concise.

Who will this benefit?

This is primarily for analytics engineers to clean up the definitions in a schema.yml file.

Are you interested in contributing this feature?

For sure!

Issue Analytics

State:
Created 3 years ago
Reactions:1
Comments:5 (3 by maintainers)

Top GitHub Comments

1reaction

jtcohen6commented, Nov 12, 2020

Neat idea @dave-connors-3! What you’re proposing sounds along the lines of a built-in YAML anchor, i.e.

models:
  - name: cool_data_model
    columns:
      - name: id
        tests: &primary_key
          - unique
          - not_null
  - name: my_other_model
    columns:
      - name: id
        tests: *primary_key

Without, of course, having to define the initial &primary_key in each file.

I heard a related idea the other day, with a slightly different angle, and I’m now convinced that the two might be connected.

Right now, dbt runs separate queries for each of unique, not_null, etc, even if they’re all defined on the same column in the same table. What if there was a way for dbt to consolidate those tests into a single query?

Essentially, I’m thinking of a genuinely different custom schema that acts as a “combo” of existing builtin tests:

models:
  - name: cool_data_model
    columns:
      - name: id
        tests:
          - primary_key

Then:

{% macro test_primary_key(model, column_name) %}

    with potential_dupes as (

        select
        
            {{ column_name }},
            count(*) as num_rows
        
        from {{ model }}
        group by 1
        
    )
    
    select sum(num_rows)
    from potential_dupes
    where {{ column_name }} is null   -- not_null
      or num_rows > 1                 -- unique
    
{% endmacro %}

Hypothetically, it would be more efficient in terms of database time and resources. The downside is ambiguity: should the test should fail, it could have been because the column was null, or not unique, or null and not unique.

What do you think?

0reactions

snajjarcommented, Oct 19, 2022

I’m facing the same problem currently. I’d like to add tests on my repo to enforce (for instance) the respect of the DBT style guide, but what I end up doing is adding a lot of tests to EVERY model in DBT (100+ and counting).

Since thoses are the same tests every time, but to be invoked from different model yml files, I didn’t find any solution with yaml anchors. I don’t see a better solution than having the option of defining a test group (@noel I’ll be really explicit on the naming!).