Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Skip yaml validation for files that are completely unchanged

See original GitHub issue

Thanks to https://github.com/dbt-labs/dbt/pull/3460, when dbt reads yaml files, it also performs minimal validation and stores their intermediate representation in dictionary form. This enables us to:

Return good error messages early & often
Better separate file loading from manifest construction
In partial parsing: only re-parse nodes affected by specific lines changed in yaml files, rather than re-parsing all resources implicated in the entire yaml file

At the same time, partial parsing runs need to read and validate every yaml file in the project, every time. This should still be really fast, even for big projects, so long as the user has libyaml installed (C bindings for PyYAML). I’m going to make sure we document this.

Would it be possible to have the best of both worlds, by storing a hash of the yaml file, and skip re-reading / validating it entirely if it’s completely unchanged?

Issue Analytics

State:
Created 2 years ago
Comments:7 (6 by maintainers)

Top GitHub Comments

1reaction

drewbanincommented, Aug 31, 2021

big +1 here - totally non-scientific but I also dug into the partial parsing perf data and found this same thing - dbt is doing a lot of (slow) work to determine that two .yml files are identical 😃

1reaction

jtcohen6commented, Aug 29, 2021

I spent some time today diving into the raw load-time data a bit more, and I’m seeing that in parsing steps where partial parsing “works” (parsed_path_count < path_count), the read_files step comprises a huge portion of the total load time, between 80-90%. File reading by itself can take 2-7 ms per file. These numbers appear to hold across both dbt Cloud and non-Cloud deployment/development environments.

In very large projects, this feels like the big thing standing between our pretty-darn-good v0.20 numbers (3-9 ms/path with partial parsing), versus our end-of-2021 goal of 1 ms / file, <5 s total in a 5k file project.

It feels like some reasonable next steps could be:

Add tracking for whether users have the optimized libyaml installed, to validate whether this is a crucial differentiator. Ensure that dbt Cloud is using the optimized C yaml bindings (cc @leahwicz)
Following the proposal in this issue, store file hashes in partial_parse.msgpack and use that as a shortcut for file reading, even if it means some light refactoring and separation/duplication of validation steps

Top Results From Across the Web

Skip yaml validation for files that are completely unchanged

Would it be possible to have the best of both worlds, by storing a hash of the yaml file, and skip re-reading /...

How to validate and clean your YAML files using Kubeval and ...

In this tutorial, you will learn how to validate your YAML files using Kubeval and ValidKube which is a web tool that cleans...

Do not run a job if no change outside specific paths

What I want is as follows: if changes are made only within streamlit-ui-scripts , skip jobs unrelated to UI and run UI related...

Azure Resource Manager deployment modes - Microsoft Learn

In incremental mode, Resource Manager leaves unchanged resources that exist in the resource group but aren't specified in the template. ...

Liquibase checksum validation error without any changes

Liquibase reads databasechangelog table to validate recent changes. So identify the databasechangelog ... In my case, changelog.yml file had incorrect id.