Hydra composition and use of variables from composed params.yaml in stage breaks commands
See original GitHub issueBug Report
Description
I am using Hydra composition to allow me to split up a configuration into separate files, and compose them together as needed for a given experiment. Thanks for adding this feature!
However, I encountered problems with my experiment setup because I am using variables from the composed configuration in the deps
and outs
keys of stages. The variables of the composed params.yaml
file define directory paths of different intermediate pipeline artifacts (i.e. prepared training input data). The reason these are variables is two-fold:
- The paths depend on a
dataset
name variable to make it easy to switch between datasets. - The paths depend on a
data.store
prefix which may either be a relative local path (e.g.,data
) or a S3 URI prefix (e.g.,s3://bucket/key/prefix/
). For the latter case, when a S3 URI is used, I have setupcache.s3
in.dvc/config
to store prepared training data as external output / dependency of the maintrain
stage. The idea being that I want to be able to run initial experiments on a local GPU server, and at a later stage experiments in AWS EC2 (Ray Cluster).
Reproduce
dvc config hydra.enabled True
- Create
conf/config.yaml
with keypath: input.txt
- Create
dvc.yaml
with stages
stages:
stage_1:
cmd: echo "Example output path" > ${path}
outs:
- ${path}
stage_2:
cmd: cat ${path}
deps:
- ${path}
- Run
dvc status dvc.yaml
which displays error
ERROR: failed to parse 'stages.stage_1.cmd' in 'dvc.yaml': Could not find 'path'
- Try
dvc exp run stage_2
, which also shows the error:
ERROR: failed to parse 'stages.stage_2.cmd' in 'dvc.yaml': Could not find 'path'
CORRECTION: dvc exp run
does work, I had failed to enable hydra in the DVC config when running this test. The issue with dvc status
when params.yaml
does not exist remains, though.
Expected
dvc status
and dvc exp show
produce expected output rather than a failure to parse the dvc.yaml
stages.
Environment information
Output of dvc doctor
:
$ dvc doctor
DVC version: 2.30.0 (rpm)
---------------------------------
Platform: Python 3.8.3 on Linux-3.10.0-1160.15.2.el7.x86_64-x86_64-with-glibc2.14
Subprojects:
Supports:
azure (adlfs = None, knack = 0.10.0, azure-identity = None),
gdrive (pydrive2 = 1.14.0),
gs (gcsfs = None),
hdfs (fsspec = None, pyarrow = 9.0.0),
http (aiohttp = None, aiohttp-retry = 2.8.3),
https (aiohttp = None, aiohttp-retry = 2.8.3),
oss (ossfs = 2021.8.0),
s3 (s3fs = None, boto3 = None),
ssh (sshfs = 2022.6.0),
webdav (webdav4 = 0.9.7),
webdavs (webdav4 = 0.9.7),
webhdfs (fsspec = None)
Cache types: hardlink, symlink
Cache directory: xfs on /dev/md124
Caches: local, s3
Remotes: s3, s3
Workspace directory: xfs on /dev/md124
Repo: dvc (subdir), git
Additional Information (if any):
I’ve installed DVC in a Docker image directly from YUM repository.
RUN wget https://dvc.org/rpm/dvc.repo -O /etc/yum.repos.d/dvc.repo \
&& rpm --import https://dvc.org/rpm/iterative.asc \
&& yum update -y \
&& yum install -y dvc-2.28.0-1 \
&& yum clean all \
&& rm -rf /var/cache/yum
(The mismatch in DVC version is because the above I used when building the Docker image, but later in the container upgraded DVC with yum
to a more recent version)
In my actual project, I had produced the params.yaml
with a first stage that I had used to just test the Hydra composition:
stages:
print_params:
cmd: cat params.yaml
outs:
- params.yaml:
cache: false
persist: true
After this, the params.yaml
existed already and I had it added to git with git add params.yaml
+ git commit
because the file was not automatically added by dvc exp run
to .gitignore
.
With the params.yaml
file existing, the dvc status
and dvc exp run
commands worked in my actual project. I only encountered the issue with dvc exp show
and also when trying to display the experiments in VS Code using the DVC extension. Only when putting the example above together I noticed that even the initial dvc exp run
would not work with this dvc.yaml
.
Issue Analytics
- State:
- Created a year ago
- Comments:13 (5 by maintainers)
Top GitHub Comments
After adding the
params.yaml
back withgit add
andgit commit
, I can see now that the Git commits which are not associated with a particular DVC experiment are no longer red and the rows in the Experiments table show the entries from theparams.yaml
at these commits. Also the “DVC Tracked” section in the file explorer works.In summary, the issue was caused by me removing the
params.yaml
from Git and not being clear that this file, which is generated bydvc exp run
should indeed be tracked with Git.@aschuh-hf yes, please create an issue in VS Code extension repo to make it more flexible (adjust which commits to show/analyze).
Should we also improve the message - explain that it’s in the HEAD commit, not in the workspace?
On a side note- we should also do something with
[31m
and39m]
cc @skshetry - should we skip adding colors in--json
mode? (it makes sense, since most likely this is expected to be consumed by a machine)