pipelines: parametrize using environment variables / DVC properties
See original GitHub issueIt would be useful if I could parametrize my pipeline using environment variables, which could be read from a properties file specified using dvc config env my.properties
. DVC would load those environment variables when running the command.
For example, I could have this properties file:
DVC_NICKNAME=David
And run:
dvc run -o hello.txt 'echo "Hello ${DVC_NICKNAME}!" > hello.txt'
dvc run -o cheers.txt 'echo "Cheers ${DVC_NICKNAME}!" > cheers.txt'
And produce “Hello David!” and “Cheers David!” files.
Users would just have to make sure to quote the command or use interactive mode #1415.
The DVC file would contain the variable reference:
cmd: echo "Hello ${DVC_NICKNAME}!" > hello.txt
The value would be added to the environment by DVC at DVC startup so it would be handled natively by the shell.
In order for dvc status
to be able to detect that variables in a stage changed, we can calculate the internal md5 checksum on contents with the variable values injected in place of the variable names, so that it would be handled as if the contents of the DVC file changed. This can be done using os.path.expandvars. But unfortunately, this would just replace variable references used directly in the shell command, it would not cover cases where you’re using the environment variable inside a script. The only foolproof way would be force the user to explicitly request environment variables that would be injected from the properties file, e.g. using dvc run -e DVC_NICKNAME -e DVC_OTHER
. That would basically allow adding additional “env dependencies” to stages.
It would be nice to inject the variables also into paths to dependencies, so that you can parametrize those as well. Could also be done using os.path.expandvars. This would change the DAG dynamically, but AFAIK it should actually magically work without breaking anything, right? As long as you just initialize the environment at each DVC startup and call expandvars when reading deps paths.
Issue Analytics
- State:
- Created 5 years ago
- Reactions:19
- Comments:27 (16 by maintainers)
@shcheklein It can be used in any pipeline when you’re providing the same parameters in different stages. I was solving it by manually specifying the parameter multiple times and I didn’t realize it could be solved using a custom config file provided as a dependency as suggested by @efiop.
The problem is that if the config properties were provided as environment variables, even a global DVC config file would have to break granularity of caching, since you could use those variables hidden inside bash scripts so there would be no way to check which variables are used.
[edited] So the only benefit would probably be if the variables could also be used in dependencies/outputs. For example, configuring the highest performing model file and using that throughout the pipeline. But not sure it’s worth the effort - currently I’m solving it by just having a special location “models/top.pkl” where I copy it.
We run dvc pipelines inside of gitlab pipelines, and this feature would be extraordinarily helpful for gathering information on what branch the pipeline is running on, etc, and making additional commits after processing.