question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Skip yaml validation for files that are completely unchanged

See original GitHub issue

Thanks to https://github.com/dbt-labs/dbt/pull/3460, when dbt reads yaml files, it also performs minimal validation and stores their intermediate representation in dictionary form. This enables us to:

  • Return good error messages early & often
  • Better separate file loading from manifest construction
  • In partial parsing: only re-parse nodes affected by specific lines changed in yaml files, rather than re-parsing all resources implicated in the entire yaml file

At the same time, partial parsing runs need to read and validate every yaml file in the project, every time. This should still be really fast, even for big projects, so long as the user has libyaml installed (C bindings for PyYAML). I’m going to make sure we document this.

Would it be possible to have the best of both worlds, by storing a hash of the yaml file, and skip re-reading / validating it entirely if it’s completely unchanged?

Issue Analytics

  • State:closed
  • Created 2 years ago
  • Comments:7 (6 by maintainers)

github_iconTop GitHub Comments

1reaction
drewbanincommented, Aug 31, 2021

big +1 here - totally non-scientific but I also dug into the partial parsing perf data and found this same thing - dbt is doing a lot of (slow) work to determine that two .yml files are identical 😃

1reaction
jtcohen6commented, Aug 29, 2021

I spent some time today diving into the raw load-time data a bit more, and I’m seeing that in parsing steps where partial parsing “works” (parsed_path_count < path_count), the read_files step comprises a huge portion of the total load time, between 80-90%. File reading by itself can take 2-7 ms per file. These numbers appear to hold across both dbt Cloud and non-Cloud deployment/development environments.

In very large projects, this feels like the big thing standing between our pretty-darn-good v0.20 numbers (3-9 ms/path with partial parsing), versus our end-of-2021 goal of 1 ms / file, <5 s total in a 5k file project.

It feels like some reasonable next steps could be:

  • Add tracking for whether users have the optimized libyaml installed, to validate whether this is a crucial differentiator. Ensure that dbt Cloud is using the optimized C yaml bindings (cc @leahwicz)
  • Following the proposal in this issue, store file hashes in partial_parse.msgpack and use that as a shortcut for file reading, even if it means some light refactoring and separation/duplication of validation steps
Read more comments on GitHub >

github_iconTop Results From Across the Web

Skip yaml validation for files that are completely unchanged
Would it be possible to have the best of both worlds, by storing a hash of the yaml file, and skip re-reading /...
Read more >
How to validate and clean your YAML files using Kubeval and ...
In this tutorial, you will learn how to validate your YAML files using Kubeval and ValidKube which is a web tool that cleans...
Read more >
Do not run a job if no change outside specific paths
What I want is as follows: if changes are made only within streamlit-ui-scripts , skip jobs unrelated to UI and run UI related...
Read more >
Azure Resource Manager deployment modes - Microsoft Learn
In incremental mode, Resource Manager leaves unchanged resources that exist in the resource group but aren't specified in the template. ...
Read more >
Liquibase checksum validation error without any changes
Liquibase reads databasechangelog table to validate recent changes. So identify the databasechangelog ... In my case, changelog.yml file had incorrect id.
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found