`dvc list -R`: listing contents of data registry fails, when using recursive flag
See original GitHub issueBug Report
dvc list -R
: listing contents of data registry fails, when using recursive flag
I setup a sample data registry containing the data generating code using a dvc.yaml pipeline. When trying to list the registry’s content, dvc list
works as intended and shows the top-level files and dirs of the repo. When using dvc list -R
it fails with a TreeError. This seems to be similar to issue #7871
and a comment regarding TreeError can be found in the Discord channel as well.
Description
dvc list works as intended
$ dvc list -vv https://github.com/hfrechen/data-registry-test
2022-08-09 13:29:13,399 TRACE: Namespace(cprofile=False, yappi=False, viztracer=False, viztracer_depth=None, cprofile_dump=None, pdb=False, instrument=False, instrument_open=False, quiet=0, verbose=2, version=None, cd='.', cmd='list', url='https://github.com/hfrechen/data-registry-test', recursive=False, dvc_only=False, json=False, rev=None, path=None, func=<class 'dvc.commands.ls.CmdList'>, parser=DvcParser(prog='dvc', usage=None, description='Data Version Control', formatter_class=<class 'argparse.RawTextHelpFormatter'>, conflict_handler='error', add_help=False))
2022-08-09 13:29:13,651 DEBUG: Creating external repo https://github.com/hfrechen/data-registry-test@None
2022-08-09 13:29:13,651 DEBUG: erepo: git clone 'https://github.com/hfrechen/data-registry-test' to a temporary dir
2022-08-09 13:29:14,895 TRACE: Context during resolution of stage data_import:
{'data': {'path': {'raw': './data/raw', 'interim': './data/interim'}}}
2022-08-09 13:29:14,898 TRACE: 50.41 ms in collecting stages from /
2022-08-09 13:29:14,898 TRACE: 2.26 mks in collecting stages from /.dvc
2022-08-09 13:29:14,899 TRACE: 6.07 mks in collecting stages from /data
2022-08-09 13:29:14,899 TRACE: 4.54 mks in collecting stages from /data/interim
2022-08-09 13:29:14,899 TRACE: 3.82 mks in collecting stages from /data/raw
2022-08-09 13:29:14,899 TRACE: 3.96 mks in collecting stages from /src
.dvcignore
.gitignore
README.md
data
dvc.lock
dvc.yaml
params.yaml
src
2022-08-09 13:29:14,905 DEBUG: Analytics is enabled.
2022-08-09 13:29:14,958 DEBUG: Trying to spawn '['daemon', '-q', 'analytics', '/tmp/tmpai6phy17']'
2022-08-09 13:29:14,960 DEBUG: Spawned '['daemon', '-q', 'analytics', '/tmp/tmpai6phy17']'
dvc list -R fails with TreeError
$ dvc list -R -vv https://github.com/hfrechen/data-registry-test
2022-08-09 13:29:37,067 TRACE: Namespace(cprofile=False, yappi=False, viztracer=False, viztracer_depth=None, cprofile_dump=None, pdb=False, instrument=False, instrument_open=False, quiet=0, verbose=2, version=None, cd='.', cmd='list', url='https://github.com/hfrechen/data-registry-test', recursive=True, dvc_only=False, json=False, rev=None, path=None, func=<class 'dvc.commands.ls.CmdList'>, parser=DvcParser(prog='dvc', usage=None, description='Data Version Control', formatter_class=<class 'argparse.RawTextHelpFormatter'>, conflict_handler='error', add_help=False))
2022-08-09 13:29:37,180 DEBUG: Creating external repo https://github.com/hfrechen/data-registry-test@None
2022-08-09 13:29:37,181 DEBUG: erepo: git clone 'https://github.com/hfrechen/data-registry-test' to a temporary dir
2022-08-09 13:29:38,134 TRACE: Context during resolution of stage data_import:
{'data': {'path': {'raw': './data/raw', 'interim': './data/interim'}}}
2022-08-09 13:29:38,137 TRACE: 32.35 ms in collecting stages from /
2022-08-09 13:29:38,137 TRACE: 2.09 mks in collecting stages from /.dvc
2022-08-09 13:29:38,137 TRACE: 6.68 mks in collecting stages from /data
2022-08-09 13:29:38,138 TRACE: 4.53 mks in collecting stages from /data/interim
2022-08-09 13:29:38,138 TRACE: 3.66 mks in collecting stages from /data/raw
2022-08-09 13:29:38,138 TRACE: 3.51 mks in collecting stages from /src
2022-08-09 13:29:38,156 ERROR: unexpected error
------------------------------------------------------------
Traceback (most recent call last):
File "/opt/conda/envs/feature-store/lib/python3.9/site-packages/dvc/cli/__init__.py", line 185, in main
ret = cmd.do_run()
File "/opt/conda/envs/feature-store/lib/python3.9/site-packages/dvc/cli/command.py", line 36, in do_run
return self.run()
File "/opt/conda/envs/feature-store/lib/python3.9/site-packages/dvc/commands/ls/__init__.py", line 31, in run
entries = Repo.ls(
File "/opt/conda/envs/feature-store/lib/python3.9/site-packages/dvc/repo/ls.py", line 46, in ls
ret = _ls(repo, path, recursive, dvc_only)
File "/opt/conda/envs/feature-store/lib/python3.9/site-packages/dvc/repo/ls.py", line 68, in _ls
for root, dirs, files in fs.walk(
File "/opt/conda/envs/feature-store/lib/python3.9/site-packages/fsspec/spec.py", line 421, in walk
yield from self.walk(d, maxdepth=maxdepth, detail=detail, **kwargs)
File "/opt/conda/envs/feature-store/lib/python3.9/site-packages/fsspec/spec.py", line 421, in walk
yield from self.walk(d, maxdepth=maxdepth, detail=detail, **kwargs)
File "/opt/conda/envs/feature-store/lib/python3.9/site-packages/fsspec/spec.py", line 389, in walk
listing = self.ls(path, detail=True, **kwargs)
File "/opt/conda/envs/feature-store/lib/python3.9/site-packages/dvc/fs/dvc.py", line 335, in ls
for entry in dvc_fs.ls(dvc_path, detail=False):
File "/opt/conda/envs/feature-store/lib/python3.9/site-packages/dvc_objects/fs/base.py", line 318, in ls
return self.fs.ls(path, detail=detail)
File "/opt/conda/envs/feature-store/lib/python3.9/site-packages/dvc_data/fs.py", line 82, in ls
for name in self.index.ls(prefix=root_key)
File "/opt/conda/envs/feature-store/lib/python3.9/site-packages/dvc_data/index.py", line 130, in ls
raise TreeError
dvc_data.objects.tree.TreeError
------------------------------------------------------------
2022-08-09 13:29:38,528 DEBUG: Version info for developers:
DVC version: 2.17.0 (conda)
---------------------------------
Platform: Python 3.9.13 on Linux-3.10.0-1160.45.1.el7.x86_64-x86_64-with-glibc2.31
Supports:
hdfs (fsspec = 2022.7.1, pyarrow = 8.0.1),
webhdfs (fsspec = 2022.7.1),
http (aiohttp = 3.8.1, aiohttp-retry = 2.7.0),
https (aiohttp = 3.8.1, aiohttp-retry = 2.7.0)
Cache types: <https://error.dvc.org/no-dvc-cache>
Caches: local
Remotes: None
Workspace directory: xfs on /dev/mapper/centos-root
Repo: dvc, git
Having any troubles? Hit us up at https://dvc.org/support, we are always happy to help!
2022-08-09 13:29:38,529 DEBUG: Analytics is enabled.
2022-08-09 13:29:38,591 DEBUG: Trying to spawn '['daemon', '-q', 'analytics', '/tmp/tmp59k200wt']'
2022-08-09 13:29:38,595 DEBUG: Spawned '['daemon', '-q', 'analytics', '/tmp/tmp59k200wt']'
This seems to affect dvc import
as well:
$ dvc import -vv https://github.com/hfrechen/data-registry-test -o data/interim data/interim
2022-08-09 13:30:18,116 TRACE: Namespace(cprofile=False, yappi=False, viztracer=False, viztracer_depth=None, cprofile_dump=None, pdb=False, instrument=False, instrument_open=False, quiet=0, verbose=2, version=None, cd='.', cmd='import', url='https://github.com/hfrechen/data-registry-test', path='data/interim', out='data/interim', rev=None, file=None, no_exec=False, desc=None, jobs=None, func=<class 'dvc.commands.imp.CmdImport'>, parser=DvcParser(prog='dvc', usage=None, description='Data Version Control', formatter_class=<class 'argparse.RawTextHelpFormatter'>, conflict_handler='error', add_help=False))
2022-08-09 13:30:18,402 TRACE: 66.82 mks in collecting stages from /home/dev/projects/data-consumer
2022-08-09 13:30:18,402 TRACE: 73.89 mks in collecting stages from /home/dev/projects/data-consumer/data
2022-08-09 13:30:18,402 TRACE: 1.97 mks in collecting stages from /home/dev/projects/data-consumer/data/interim
2022-08-09 13:30:18,677 DEBUG: Removing output 'data/interim/interim' of stage: 'data/interim/interim.dvc'.
2022-08-09 13:30:18,677 DEBUG: Removing '/home/dev/projects/data-consumer/data/interim/interim'
Importing 'data/interim (https://github.com/hfrechen/data-registry-test)' -> 'data/interim/interim'
2022-08-09 13:30:18,678 DEBUG: Computed stage: 'data/interim/interim.dvc' md5: '1e3c54ddb027f31952ea5d2c65f3ed8e'
2022-08-09 13:30:18,678 DEBUG: 'md5' of stage: 'data/interim/interim.dvc' changed.
2022-08-09 13:30:18,679 DEBUG: Creating external repo https://github.com/hfrechen/data-registry-test@None
2022-08-09 13:30:18,679 DEBUG: erepo: git clone 'https://github.com/hfrechen/data-registry-test' to a temporary dir
2022-08-09 13:30:19,651 DEBUG: Checking if stage '/data/interim' is in 'dvc.yaml'
2022-08-09 13:30:19,687 TRACE: Context during resolution of stage data_import:
{'data': {'path': {'raw': './data/raw', 'interim': './data/interim'}}}
2022-08-09 13:30:19,690 TRACE: 37.28 ms in collecting stages from /
2022-08-09 13:30:19,690 TRACE: 2.45 mks in collecting stages from /.dvc
2022-08-09 13:30:19,690 TRACE: 7.00 mks in collecting stages from /data
2022-08-09 13:30:19,690 TRACE: 5.64 mks in collecting stages from /data/interim
2022-08-09 13:30:19,690 TRACE: 3.54 mks in collecting stages from /data/raw
2022-08-09 13:30:19,691 TRACE: 3.58 mks in collecting stages from /src
2022-08-09 13:30:19,697 ERROR: failed to import 'data/interim' from 'https://github.com/hfrechen/data-registry-test'. - The path 'data/interim' does not exist in the target repository 'https://github.com/hfrechen/data-registry-test' neither as a DVC output nor as a Git-tracked file.:
------------------------------------------------------------
Traceback (most recent call last):
File "/opt/conda/envs/feature-store/lib/python3.9/site-packages/dvc/dependency/repo.py", line 134, in _get_used_and_obj
object_store, _, obj = build(
File "/opt/conda/envs/feature-store/lib/python3.9/site-packages/dvc_data/build.py", line 245, in build
meta, obj = _build_tree(
File "/opt/conda/envs/feature-store/lib/python3.9/site-packages/dvc_data/build.py", line 123, in _build_tree
for root, _, fnames in walk_iter:
File "/opt/conda/envs/feature-store/lib/python3.9/site-packages/fsspec/spec.py", line 389, in walk
listing = self.ls(path, detail=True, **kwargs)
File "/opt/conda/envs/feature-store/lib/python3.9/site-packages/dvc/fs/dvc.py", line 335, in ls
for entry in dvc_fs.ls(dvc_path, detail=False):
File "/opt/conda/envs/feature-store/lib/python3.9/site-packages/dvc_objects/fs/base.py", line 318, in ls
return self.fs.ls(path, detail=detail)
File "/opt/conda/envs/feature-store/lib/python3.9/site-packages/dvc_data/fs.py", line 82, in ls
for name in self.index.ls(prefix=root_key)
File "/opt/conda/envs/feature-store/lib/python3.9/site-packages/dvc_data/index.py", line 130, in ls
raise TreeError
dvc_data.objects.tree.TreeError
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/opt/conda/envs/feature-store/lib/python3.9/site-packages/dvc/commands/imp.py", line 15, in run
self.repo.imp(
File "/opt/conda/envs/feature-store/lib/python3.9/site-packages/dvc/repo/imp.py", line 6, in imp
return self.imp_url(
File "/opt/conda/envs/feature-store/lib/python3.9/site-packages/dvc/repo/__init__.py", line 48, in wrapper
return f(repo, *args, **kwargs)
File "/opt/conda/envs/feature-store/lib/python3.9/site-packages/dvc/repo/scm_context.py", line 156, in run
return method(repo, *args, **kw)
File "/opt/conda/envs/feature-store/lib/python3.9/site-packages/dvc/repo/imp_url.py", line 83, in imp_url
stage.run(jobs=jobs)
File "/opt/conda/envs/feature-store/lib/python3.9/site-packages/funcy/decorators.py", line 45, in wrapper
return deco(call, *dargs, **dkwargs)
File "/opt/conda/envs/feature-store/lib/python3.9/site-packages/dvc/stage/decorators.py", line 38, in rwlocked
return call()
File "/opt/conda/envs/feature-store/lib/python3.9/site-packages/funcy/decorators.py", line 66, in __call__
return self._func(*self._args, **self._kwargs)
File "/opt/conda/envs/feature-store/lib/python3.9/site-packages/dvc/stage/__init__.py", line 535, in run
self._sync_import(dry, force, jobs)
File "/opt/conda/envs/feature-store/lib/python3.9/site-packages/funcy/decorators.py", line 45, in wrapper
return deco(call, *dargs, **dkwargs)
File "/opt/conda/envs/feature-store/lib/python3.9/site-packages/dvc/stage/decorators.py", line 38, in rwlocked
return call()
File "/opt/conda/envs/feature-store/lib/python3.9/site-packages/funcy/decorators.py", line 66, in __call__
return self._func(*self._args, **self._kwargs)
File "/opt/conda/envs/feature-store/lib/python3.9/site-packages/dvc/stage/__init__.py", line 559, in _sync_import
sync_import(self, dry, force, jobs)
File "/opt/conda/envs/feature-store/lib/python3.9/site-packages/dvc/stage/imports.py", line 43, in sync_import
stage.deps[0].download(stage.outs[0], jobs=jobs)
File "/opt/conda/envs/feature-store/lib/python3.9/site-packages/dvc/dependency/repo.py", line 68, in download
for odb, objs in self.get_used_objs().items():
File "/opt/conda/envs/feature-store/lib/python3.9/site-packages/dvc/dependency/repo.py", line 97, in get_used_objs
used, _ = self._get_used_and_obj(**kwargs)
File "/opt/conda/envs/feature-store/lib/python3.9/site-packages/dvc/dependency/repo.py", line 141, in _get_used_and_obj
raise PathMissingError(
dvc.exceptions.PathMissingError: The path 'data/interim' does not exist in the target repository 'https://github.com/hfrechen/data-registry-test' neither as a DVC output nor as a Git-tracked file.
------------------------------------------------------------
2022-08-09 13:30:19,706 DEBUG: Analytics is enabled.
2022-08-09 13:30:19,914 DEBUG: Trying to spawn '['daemon', '-q', 'analytics', '/tmp/tmp3p8liv51']'
2022-08-09 13:30:19,917 DEBUG: Spawned '['daemon', '-q', 'analytics', '/tmp/tmp3p8liv51']'
Reproduce
I created a sample repo https://github.com/hfrechen/data-registry-test to try it out. Either command fails for me on different combinations of Ubuntu and Windows clients for DVC 2.15, 2.16 and 2.17.
- dvc list -vv https://github.com/hfrechen/data-registry-test
- dvc list -R -vv https://github.com/hfrechen/data-registry-test
- dvc import https://github.com/hfrechen/data-registry-test -o data/interim data/interim -vv
Expected
dvc list -R
should be listing all subdirectories and files contained in the data registry
Environment information
For me could be reproduced using a clean conda environment just with conda create -n dvc -c conda-forge dvc
Output of dvc doctor
:
$ dvc doctor
DVC version: 2.17.0 (conda)
---------------------------------
Platform: Python 3.9.13 on Linux-3.10.0-1160.45.1.el7.x86_64-x86_64-with-glibc2.31
Supports:
hdfs (fsspec = 2022.7.1, pyarrow = 8.0.1),
webhdfs (fsspec = 2022.7.1),
http (aiohttp = 3.8.1, aiohttp-retry = 2.7.0),
https (aiohttp = 3.8.1, aiohttp-retry = 2.7.0)
Cache types: <https://error.dvc.org/no-dvc-cache>
Caches: local
Remotes: None
Workspace directory: xfs on /dev/mapper/centos-root
Repo: dvc, git
Additional Information (if any):
I tried to debug and tracked it down to this line https://github.com/iterative/dvc-data/blob/main/src/dvc_data/index.py#L138 State of the local objects, entry.obj seems to be None:
(Pdb) prefix
('data', 'interim')
(Pdb) self._trie
Trie([(('data', 'interim'), DataIndexEntry(meta=<dvc_data.hashfile.meta.Meta object at 0x7f7e7b30a840>, obj=None, hash_info=<dvc_data.hashfile.hash_info.HashInfo object at 0x7f7e7b23c0c0>, odb=<dvc_data.db.local.LocalHashFileDB object at 0x7f7e7b2cf3a0>, remote=None, loaded=None))])
(Pdb) entry
DataIndexEntry(meta=<dvc_data.hashfile.meta.Meta object at 0x7f7e7b30a840>, obj=None, hash_info=<dvc_data.hashfile.hash_info.HashInfo object at 0x7f7e7b23c0c0>, odb=<dvc_data.db.local.LocalHashFileDB object at 0x7f7e7b2cf3a0>, remote=None, loaded=None)
(Pdb) entry.obj
(Pdb) entry.obj is None
True
Issue Analytics
- State:
- Created a year ago
- Comments:5 (1 by maintainers)
Top GitHub Comments
It doesn’t look like there is any default remote set in https://github.com/hfrechen/data-registry-test, without which DVC can’t granularly list the contents of directories it’s tracking or import data. In DVC<=2.9.4, I get the error
dvc.exceptions.NoRemoteInExternalRepoError: No DVC remote is specified in target repository 'https://github.com/hfrechen/data-registry-test'.
. It’s probably best to keep showing an error like this.Just for your information. The TreeError appeared another time for me, again with the not so meaningful error message
Took me a while to figure out, what the reason was, because the default remote was set.
dvc repro
which was executed successfullydvc push
and tried to import the new stage of the data registry in another repo.dvc list -R
failed with TreeError, due to the non-existant new stage in the DVC remoteMaybe you could improve the error messages here as well. Something like this? Issue 1: “Default remote not set. Please configure in .dvc/config” Issue 2: “Files tracked in your registry could not be found on remote”