Skip yaml validation for files that are completely unchanged
See original GitHub issueThanks to https://github.com/dbt-labs/dbt/pull/3460, when dbt reads yaml files, it also performs minimal validation and stores their intermediate representation in dictionary form. This enables us to:
- Return good error messages early & often
- Better separate file loading from manifest construction
- In partial parsing: only re-parse nodes affected by specific lines changed in yaml files, rather than re-parsing all resources implicated in the entire yaml file
At the same time, partial parsing runs need to read and validate every yaml file in the project, every time. This should still be really fast, even for big projects, so long as the user has libyaml
installed (C bindings for PyYAML
). I’m going to make sure we document this.
Would it be possible to have the best of both worlds, by storing a hash of the yaml file, and skip re-reading / validating it entirely if it’s completely unchanged?
Issue Analytics
- State:
- Created 2 years ago
- Comments:7 (6 by maintainers)
Top Results From Across the Web
Skip yaml validation for files that are completely unchanged
Would it be possible to have the best of both worlds, by storing a hash of the yaml file, and skip re-reading /...
Read more >How to validate and clean your YAML files using Kubeval and ...
In this tutorial, you will learn how to validate your YAML files using Kubeval and ValidKube which is a web tool that cleans...
Read more >Do not run a job if no change outside specific paths
What I want is as follows: if changes are made only within streamlit-ui-scripts , skip jobs unrelated to UI and run UI related...
Read more >Azure Resource Manager deployment modes - Microsoft Learn
In incremental mode, Resource Manager leaves unchanged resources that exist in the resource group but aren't specified in the template. ...
Read more >Liquibase checksum validation error without any changes
Liquibase reads databasechangelog table to validate recent changes. So identify the databasechangelog ... In my case, changelog.yml file had incorrect id.
Read more >
Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free
Top Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
big +1 here - totally non-scientific but I also dug into the partial parsing perf data and found this same thing - dbt is doing a lot of (slow) work to determine that two .yml files are identical 😃
I spent some time today diving into the raw load-time data a bit more, and I’m seeing that in parsing steps where partial parsing “works” (
parsed_path_count < path_count
), theread_files
step comprises a huge portion of the total load time, between 80-90%. File reading by itself can take 2-7 ms per file. These numbers appear to hold across both dbt Cloud and non-Cloud deployment/development environments.In very large projects, this feels like the big thing standing between our pretty-darn-good v0.20 numbers (3-9 ms/path with partial parsing), versus our end-of-2021 goal of 1 ms / file, <5 s total in a 5k file project.
It feels like some reasonable next steps could be:
libyaml
installed, to validate whether this is a crucial differentiator. Ensure that dbt Cloud is using the optimized C yaml bindings (cc @leahwicz)partial_parse.msgpack
and use that as a shortcut for file reading, even if it means some light refactoring and separation/duplication of validation steps