dvc.api.open (python): fails with an absolute path
See original GitHub issueDescription
When trying to dvc.api.open
a file present on a remote but not locally (neither in cache), call fails with the following exception if the provided path is absolute (it works fine if the provided path is relative to the dvc
repository):
---------------------------------------------------------------------------
KeyError Traceback (most recent call last)
File /usr/local/lib/python3.9/site-packages/dvc/fs/data.py:125, in _DataFileSystem.info(self, path, **kwargs)
124 try:
--> 125 outs = list(self.repo.index.tree.iteritems(key)) # noqa: B301
126 except KeyError as exc:
File /usr/local/lib/python3.9/site-packages/dvc_data/objects/tree.py:106, in Tree.iteritems(self, prefix)
104 self._load(key, meta, hash_info)
--> 106 for key, (meta, hash_info) in self._trie.iteritems(**kwargs):
107 self._load(key, meta, hash_info)
File /usr/local/lib/python3.9/site-packages/pygtrie.py:718, in Trie.iteritems(self, prefix, shallow)
678 """Yields all nodes with associated values with given prefix.
679
680 Only nodes with values are output. For example::
(...)
716 KeyError: If ``prefix`` does not match any node.
717 """
--> 718 node, _ = self._get_node(prefix)
719 for path, value in node.iterate(list(self.__path_from_key(prefix)),
720 shallow, self._iteritems):
File /usr/local/lib/python3.9/site-packages/pygtrie.py:630, in Trie._get_node(self, key)
629 if node is None:
--> 630 raise KeyError(key)
631 trace.append((step, node))
KeyError: ('local', '/', 'repo', 'path', 'to', 'target.ext')
The above exception was the direct cause of the following exception:
FileNotFoundError Traceback (most recent call last)
File /usr/local/lib/python3.9/site-packages/dvc/repo/__init__.py:505, in Repo.open_by_relpath(self, path, remote, mode, encoding)
504 try:
--> 505 with fs.open(
506 fs_path,
507 mode=mode,
508 encoding=encoding,
509 remote=remote,
510 ) as fobj:
511 yield fobj
File /usr/local/lib/python3.9/site-packages/dvc_objects/fs/base.py:191, in FileSystem.open(self, path, mode, **kwargs)
190 kwargs.pop("encoding", None)
--> 191 return self.fs.open(path, mode=mode, **kwargs)
File /usr/local/lib/python3.9/site-packages/dvc/fs/data.py:88, in _DataFileSystem.open(self, path, mode, encoding, **kwargs)
85 def open( # type: ignore
86 self, path: str, mode="r", encoding=None, **kwargs
87 ): # pylint: disable=arguments-renamed, arguments-differ
---> 88 fs, fspath = self._get_fs_path(path, **kwargs)
89 return fs.open(fspath, mode=mode, encoding=encoding)
File /usr/local/lib/python3.9/site-packages/dvc/fs/data.py:65, in _DataFileSystem._get_fs_path(self, path, remote)
63 from dvc.config import NoRemoteError
---> 65 info = self.info(path)
66 if info["type"] == "directory":
File /usr/local/lib/python3.9/site-packages/dvc/fs/data.py:127, in _DataFileSystem.info(self, path, **kwargs)
126 except KeyError as exc:
--> 127 raise FileNotFoundError from exc
129 ret = {
130 "type": "file",
131 "size": 0,
(...)
135 "name": path,
136 }
FileNotFoundError:
The above exception was the direct cause of the following exception:
FileMissingError Traceback (most recent call last)
Input In [7], in <cell line: 1>()
----> 1 with open('/repo/path/to/target.ext') as file:
2 print(file.read(10))
File /usr/local/lib/python3.9/contextlib.py:119, in _GeneratorContextManager.__enter__(self)
117 del self.args, self.kwds, self.func
118 try:
--> 119 return next(self.gen)
120 except StopIteration:
121 raise RuntimeError("generator didn't yield") from None
File /usr/local/lib/python3.9/site-packages/dvc/api/data.py:198, in _open(path, repo, rev, remote, mode, encoding)
196 def _open(path, repo=None, rev=None, remote=None, mode="r", encoding=None):
197 with Repo.open(repo, rev=rev, subrepos=True, uninitialized=True) as _repo:
--> 198 with _repo.open_by_relpath(
199 path, remote=remote, mode=mode, encoding=encoding
200 ) as fd:
201 yield fd
File /usr/local/lib/python3.9/contextlib.py:119, in _GeneratorContextManager.__enter__(self)
117 del self.args, self.kwds, self.func
118 try:
--> 119 return next(self.gen)
120 except StopIteration:
121 raise RuntimeError("generator didn't yield") from None
File /usr/local/lib/python3.9/site-packages/dvc/repo/__init__.py:513, in Repo.open_by_relpath(self, path, remote, mode, encoding)
511 yield fobj
512 except FileNotFoundError as exc:
--> 513 raise FileMissingError(path) from exc
514 except IsADirectoryError as exc:
515 raise DvcIsADirectoryError(f"'{path}' is a directory") from exc
FileMissingError: Can't find '/repo/path/to/target.ext' neither locally nor on remote
Reproduce
Create a repository (at e.g. /repo
) with an s3
remote; add a file (/repo/path/to/target.ext
) with some lipsum content; add it to dvc; push to remote; delete locally (including cache); then
from dvc.api import open
with open('/repo/path/to/target.ext') as file:
pass
should raise, while
from dvc.api import open
with open('path/to/target.ext') as file:
pass
should work as expected (assuming pwd
being at /repo
).
Expected
Relative path being interpreted as relative to the current working directory, and absolute path as the path they describe.
Environment information
Output of dvc doctor
:
$ dvc doctor
DVC version: 2.13.0 (pip)
---------------------------------
Platform: Python 3.9.13 on Linux-5.10.104-linuxkit-x86_64-with-glibc2.31
Supports:
webhdfs (fsspec = 2022.5.0),
http (aiohttp = 3.8.1, aiohttp-retry = 2.5.2),
https (aiohttp = 3.8.1, aiohttp-retry = 2.5.2),
s3 (s3fs = 2022.5.0, boto3 = 1.21.21),
ssh (sshfs = 2022.6.0)
Cache types: <https://error.dvc.org/no-dvc-cache>
Caches: local
Remotes: s3
Workspace directory: overlay on overlay
Repo: dvc (no_scm)
Issue Analytics
- State:
- Created a year ago
- Comments:8 (4 by maintainers)
Top Results From Across the Web
dvc.api.open()
path (required) - location and file name of the target to open, relative to the root of the project ( repo ). repo...
Read more >dvc.api.read() raises an "UnicodeDecodeError" - Stack Overflow
The scope of using dvc.api.read() is only to retrieve/stream the data files from DVC remote to a Python script. Only afterwards can Pydicom,...
Read more >Data & Model Management with DVC | Analytics Vidhya
DVC, developed by Iterative.AI is an open source command-line tool written in Python for data science & ML project management & versioning. To ......
Read more >Data Version Control With Python and DVC - Real Python
dvc file is lightweight and meant to be stored with your code in GitHub. When you download a Git repository, you also get...
Read more >API Reference — fsspec 2022.11.0+13.g0974514.dirty ...
Given a path or paths, return one OpenFile object. fsspec.open_local (url[, mode]). Open file(s) which can be resolved to local.
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
@hugo-ricateau Absolute paths in open() are considered to be external local outputs (legacy thing). If you have an absolute path already to a file, you should be able to just open it yourself, as
dvc.api.open
is really meant to work with dvc repositories with paths that are detached from where your repo is located.Closing as expected behaviour.
@efiop Let’s keep it open as a feature request?
It makes sense that to avoid having to separately pass a repo URL and a separate relative path URL. @efiop and I have discussed that we both would prefer to get rid of this distinction.
I can’t promise that we will get to it any time soon, though. I think the current behavior is expected and explained in the docstring:
https://github.com/iterative/dvc/blob/44a4fb59a7ef812ff84671fc68d1ae7843b9ac03/dvc/api/data.py#L73-L77
The same is explained in https://dvc.org/doc/api-reference/open#parameters: