Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

pipelines: parametrize using environment variables / DVC properties

See original GitHub issue

It would be useful if I could parametrize my pipeline using environment variables, which could be read from a properties file specified using dvc config env my.properties. DVC would load those environment variables when running the command.

For example, I could have this properties file:

DVC_NICKNAME=David

And run:

dvc run -o hello.txt 'echo "Hello ${DVC_NICKNAME}!" > hello.txt'
dvc run -o cheers.txt 'echo "Cheers ${DVC_NICKNAME}!" > cheers.txt'

And produce “Hello David!” and “Cheers David!” files.

Users would just have to make sure to quote the command or use interactive mode #1415.

The DVC file would contain the variable reference:

cmd: echo "Hello ${DVC_NICKNAME}!" > hello.txt

The value would be added to the environment by DVC at DVC startup so it would be handled natively by the shell.

In order for dvc status to be able to detect that variables in a stage changed, we can calculate the internal md5 checksum on contents with the variable values injected in place of the variable names, so that it would be handled as if the contents of the DVC file changed. This can be done using os.path.expandvars. But unfortunately, this would just replace variable references used directly in the shell command, it would not cover cases where you’re using the environment variable inside a script. The only foolproof way would be force the user to explicitly request environment variables that would be injected from the properties file, e.g. using dvc run -e DVC_NICKNAME -e DVC_OTHER. That would basically allow adding additional “env dependencies” to stages.

It would be nice to inject the variables also into paths to dependencies, so that you can parametrize those as well. Could also be done using os.path.expandvars. This would change the DAG dynamically, but AFAIK it should actually magically work without breaking anything, right? As long as you just initialize the environment at each DVC startup and call expandvars when reading deps paths.

Issue Analytics

State:
Created 5 years ago
Reactions:19
Comments:27 (16 by maintainers)

Top GitHub Comments

4reactions

prihodacommented, Jan 7, 2019

@shcheklein It can be used in any pipeline when you’re providing the same parameters in different stages. I was solving it by manually specifying the parameter multiple times and I didn’t realize it could be solved using a custom config file provided as a dependency as suggested by @efiop.

The problem is that if the config properties were provided as environment variables, even a global DVC config file would have to break granularity of caching, since you could use those variables hidden inside bash scripts so there would be no way to check which variables are used.

[edited] So the only benefit would probably be if the variables could also be used in dependencies/outputs. For example, configuring the highest performing model file and using that throughout the pipeline. But not sure it’s worth the effort - currently I’m solving it by just having a special location “models/top.pkl” where I copy it.

3reactions

stephanrb3commented, Jun 29, 2022

We run dvc pipelines inside of gitlab pipelines, and this feature would be extraordinarily helpful for gathering information on what branch the pipeline is running on, etc, and making additional commits after processing.

Top Results From Across the Web

Environment support - Feature Requests - DVC

I'm fairly new to dvc, but I have worked on a few projects that relied ... parametrize using environment variables / DVC properties...

Define variables - Azure Pipelines | Microsoft Learn

Variables are name-value pairs defined by you for use in a pipeline. You can use variables as inputs to tasks and in your...

The ultimate guide to building maintainable Machine Learning ...

DVC is an open-source version control system for Machine Learning projects. ... The principle to build pipelines using DVC.

How to use dynamic variable groups in Azure DevOps YAML ...

In the YAML pipelines, the best way would be to make use of variable groups, but these still support only one environment.

Explore the World with Interval International

As part of the World Collection, Interval International® provides Disney Vacation Club Members access to thousands of properties featuring a wide range of ......