question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Hydra composition and use of variables from composed params.yaml in stage breaks commands

See original GitHub issue

Bug Report

Description

I am using Hydra composition to allow me to split up a configuration into separate files, and compose them together as needed for a given experiment. Thanks for adding this feature!

However, I encountered problems with my experiment setup because I am using variables from the composed configuration in the deps and outs keys of stages. The variables of the composed params.yaml file define directory paths of different intermediate pipeline artifacts (i.e. prepared training input data). The reason these are variables is two-fold:

  • The paths depend on a dataset name variable to make it easy to switch between datasets.
  • The paths depend on a data.store prefix which may either be a relative local path (e.g., data) or a S3 URI prefix (e.g., s3://bucket/key/prefix/). For the latter case, when a S3 URI is used, I have setup cache.s3 in .dvc/config to store prepared training data as external output / dependency of the main train stage. The idea being that I want to be able to run initial experiments on a local GPU server, and at a later stage experiments in AWS EC2 (Ray Cluster).

Reproduce

  1. dvc config hydra.enabled True
  2. Create conf/config.yaml with key path: input.txt
  3. Create dvc.yaml with stages
stages:
  stage_1:
    cmd: echo "Example output path" > ${path}
    outs:
    - ${path}
  stage_2:
    cmd: cat ${path}
    deps:
    - ${path}
  1. Run dvc status dvc.yaml which displays error
ERROR: failed to parse 'stages.stage_1.cmd' in 'dvc.yaml': Could not find 'path'
  1. Try dvc exp run stage_2, which also shows the error:
ERROR: failed to parse 'stages.stage_2.cmd' in 'dvc.yaml': Could not find 'path'

CORRECTION: dvc exp run does work, I had failed to enable hydra in the DVC config when running this test. The issue with dvc status when params.yaml does not exist remains, though.

Expected

dvc status and dvc exp show produce expected output rather than a failure to parse the dvc.yaml stages.

Environment information

Output of dvc doctor:

$ dvc doctor
DVC version: 2.30.0 (rpm)
---------------------------------
Platform: Python 3.8.3 on Linux-3.10.0-1160.15.2.el7.x86_64-x86_64-with-glibc2.14
Subprojects:

Supports:
        azure (adlfs = None, knack = 0.10.0, azure-identity = None),
        gdrive (pydrive2 = 1.14.0),
        gs (gcsfs = None),
        hdfs (fsspec = None, pyarrow = 9.0.0),
        http (aiohttp = None, aiohttp-retry = 2.8.3),
        https (aiohttp = None, aiohttp-retry = 2.8.3),
        oss (ossfs = 2021.8.0),
        s3 (s3fs = None, boto3 = None),
        ssh (sshfs = 2022.6.0),
        webdav (webdav4 = 0.9.7),
        webdavs (webdav4 = 0.9.7),
        webhdfs (fsspec = None)
Cache types: hardlink, symlink
Cache directory: xfs on /dev/md124
Caches: local, s3
Remotes: s3, s3
Workspace directory: xfs on /dev/md124
Repo: dvc (subdir), git

Additional Information (if any):

I’ve installed DVC in a Docker image directly from YUM repository.

RUN wget https://dvc.org/rpm/dvc.repo -O /etc/yum.repos.d/dvc.repo \
    && rpm --import https://dvc.org/rpm/iterative.asc \
    && yum update -y \
    && yum install -y dvc-2.28.0-1 \
    && yum clean all \
    && rm -rf /var/cache/yum

(The mismatch in DVC version is because the above I used when building the Docker image, but later in the container upgraded DVC with yum to a more recent version)

In my actual project, I had produced the params.yaml with a first stage that I had used to just test the Hydra composition:

stages:
  print_params:
    cmd: cat params.yaml
    outs:
    - params.yaml:
        cache: false
        persist: true

After this, the params.yaml existed already and I had it added to git with git add params.yaml + git commit because the file was not automatically added by dvc exp run to .gitignore.

With the params.yaml file existing, the dvc status and dvc exp run commands worked in my actual project. I only encountered the issue with dvc exp show and also when trying to display the experiments in VS Code using the DVC extension. Only when putting the example above together I noticed that even the initial dvc exp run would not work with this dvc.yaml.

Issue Analytics

  • State:open
  • Created a year ago
  • Comments:13 (5 by maintainers)

github_iconTop GitHub Comments

1reaction
aschuh-hfcommented, Nov 2, 2022

After adding the params.yaml back with git add and git commit, I can see now that the Git commits which are not associated with a particular DVC experiment are no longer red and the rows in the Experiments table show the entries from the params.yaml at these commits. Also the “DVC Tracked” section in the file explorer works.

In summary, the issue was caused by me removing the params.yaml from Git and not being clear that this file, which is generated by dvc exp run should indeed be tracked with Git.

1reaction
shchekleincommented, Nov 1, 2022

@aschuh-hf yes, please create an issue in VS Code extension repo to make it more flexible (adjust which commits to show/analyze).

Should we also improve the message - explain that it’s in the HEAD commit, not in the workspace?

On a side note- we should also do something with [31m and 39m] cc @skshetry - should we skip adding colors in --json mode? (it makes sense, since most likely this is expected to be consumed by a machine)

Read more comments on GitHub >

github_iconTop Results From Across the Web

apply should keep stash and hint at how to recover ... - GitHub
dberenbaum mentioned this issue 15 days ago. Hydra composition and use of variables from composed params.yaml in stage breaks commands #8486.
Read more >
params | Data Version Control - DVC
This command shows the difference in parameters between the workspace and the last committed version of the params.yaml file. In our example there's...
Read more >
Specializing configuration - Hydra
We want to specialize the config based on the choice of the selected dataset and model: Furthermore, we only want to do it...
Read more >
Define variables - Azure Pipelines | Microsoft Learn
Variables are name-value pairs defined by you for use in a pipeline. You can use variables as inputs to tasks and in your...
Read more >
Value interpolation with hydra composition - Stack Overflow
OmegaConf interpolation is absolute and is operating on the final config. Try this: Hydra 1.0 (Stable). predictions: path: ".
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found