question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

dvc.api.open (python): fails with an absolute path

See original GitHub issue

Description

When trying to dvc.api.open a file present on a remote but not locally (neither in cache), call fails with the following exception if the provided path is absolute (it works fine if the provided path is relative to the dvc repository):

---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
File /usr/local/lib/python3.9/site-packages/dvc/fs/data.py:125, in _DataFileSystem.info(self, path, **kwargs)
    124 try:
--> 125     outs = list(self.repo.index.tree.iteritems(key))  # noqa: B301
    126 except KeyError as exc:

File /usr/local/lib/python3.9/site-packages/dvc_data/objects/tree.py:106, in Tree.iteritems(self, prefix)
    104         self._load(key, meta, hash_info)
--> 106 for key, (meta, hash_info) in self._trie.iteritems(**kwargs):
    107     self._load(key, meta, hash_info)

File /usr/local/lib/python3.9/site-packages/pygtrie.py:718, in Trie.iteritems(self, prefix, shallow)
    678 """Yields all nodes with associated values with given prefix.
    679
    680 Only nodes with values are output.  For example::
   (...)
    716     KeyError: If ``prefix`` does not match any node.
    717 """
--> 718 node, _ = self._get_node(prefix)
    719 for path, value in node.iterate(list(self.__path_from_key(prefix)),
    720                                 shallow, self._iteritems):

File /usr/local/lib/python3.9/site-packages/pygtrie.py:630, in Trie._get_node(self, key)
    629 if node is None:
--> 630     raise KeyError(key)
    631 trace.append((step, node))

KeyError: ('local', '/', 'repo', 'path', 'to', 'target.ext')

The above exception was the direct cause of the following exception:

FileNotFoundError                         Traceback (most recent call last)
File /usr/local/lib/python3.9/site-packages/dvc/repo/__init__.py:505, in Repo.open_by_relpath(self, path, remote, mode, encoding)
    504 try:
--> 505     with fs.open(
    506         fs_path,
    507         mode=mode,
    508         encoding=encoding,
    509         remote=remote,
    510     ) as fobj:
    511         yield fobj

File /usr/local/lib/python3.9/site-packages/dvc_objects/fs/base.py:191, in FileSystem.open(self, path, mode, **kwargs)
    190     kwargs.pop("encoding", None)
--> 191 return self.fs.open(path, mode=mode, **kwargs)

File /usr/local/lib/python3.9/site-packages/dvc/fs/data.py:88, in _DataFileSystem.open(self, path, mode, encoding, **kwargs)
     85 def open(  # type: ignore
     86     self, path: str, mode="r", encoding=None, **kwargs
     87 ):  # pylint: disable=arguments-renamed, arguments-differ
---> 88     fs, fspath = self._get_fs_path(path, **kwargs)
     89     return fs.open(fspath, mode=mode, encoding=encoding)

File /usr/local/lib/python3.9/site-packages/dvc/fs/data.py:65, in _DataFileSystem._get_fs_path(self, path, remote)
     63 from dvc.config import NoRemoteError
---> 65 info = self.info(path)
     66 if info["type"] == "directory":

File /usr/local/lib/python3.9/site-packages/dvc/fs/data.py:127, in _DataFileSystem.info(self, path, **kwargs)
    126 except KeyError as exc:
--> 127     raise FileNotFoundError from exc
    129 ret = {
    130     "type": "file",
    131     "size": 0,
   (...)
    135     "name": path,
    136 }

FileNotFoundError:

The above exception was the direct cause of the following exception:

FileMissingError                          Traceback (most recent call last)
Input In [7], in <cell line: 1>()
----> 1 with open('/repo/path/to/target.ext') as file:
      2     print(file.read(10))

File /usr/local/lib/python3.9/contextlib.py:119, in _GeneratorContextManager.__enter__(self)
    117 del self.args, self.kwds, self.func
    118 try:
--> 119     return next(self.gen)
    120 except StopIteration:
    121     raise RuntimeError("generator didn't yield") from None

File /usr/local/lib/python3.9/site-packages/dvc/api/data.py:198, in _open(path, repo, rev, remote, mode, encoding)
    196 def _open(path, repo=None, rev=None, remote=None, mode="r", encoding=None):
    197     with Repo.open(repo, rev=rev, subrepos=True, uninitialized=True) as _repo:
--> 198         with _repo.open_by_relpath(
    199             path, remote=remote, mode=mode, encoding=encoding
    200         ) as fd:
    201             yield fd

File /usr/local/lib/python3.9/contextlib.py:119, in _GeneratorContextManager.__enter__(self)
    117 del self.args, self.kwds, self.func
    118 try:
--> 119     return next(self.gen)
    120 except StopIteration:
    121     raise RuntimeError("generator didn't yield") from None

File /usr/local/lib/python3.9/site-packages/dvc/repo/__init__.py:513, in Repo.open_by_relpath(self, path, remote, mode, encoding)
    511         yield fobj
    512 except FileNotFoundError as exc:
--> 513     raise FileMissingError(path) from exc
    514 except IsADirectoryError as exc:
    515     raise DvcIsADirectoryError(f"'{path}' is a directory") from exc

FileMissingError: Can't find '/repo/path/to/target.ext' neither locally nor on remote

Reproduce

Create a repository (at e.g. /repo) with an s3 remote; add a file (/repo/path/to/target.ext) with some lipsum content; add it to dvc; push to remote; delete locally (including cache); then

from dvc.api import open

with open('/repo/path/to/target.ext') as file:
    pass

should raise, while

from dvc.api import open

with open('path/to/target.ext') as file:
    pass

should work as expected (assuming pwd being at /repo).

Expected

Relative path being interpreted as relative to the current working directory, and absolute path as the path they describe.

Environment information

Output of dvc doctor:

$ dvc doctor
DVC version: 2.13.0 (pip)
---------------------------------
Platform: Python 3.9.13 on Linux-5.10.104-linuxkit-x86_64-with-glibc2.31
Supports:
	webhdfs (fsspec = 2022.5.0),
	http (aiohttp = 3.8.1, aiohttp-retry = 2.5.2),
	https (aiohttp = 3.8.1, aiohttp-retry = 2.5.2),
	s3 (s3fs = 2022.5.0, boto3 = 1.21.21),
	ssh (sshfs = 2022.6.0)
Cache types: <https://error.dvc.org/no-dvc-cache>
Caches: local
Remotes: s3
Workspace directory: overlay on overlay
Repo: dvc (no_scm)

Issue Analytics

  • State:open
  • Created a year ago
  • Comments:8 (4 by maintainers)

github_iconTop GitHub Comments

3reactions
efiopcommented, Jul 21, 2022

@hugo-ricateau Absolute paths in open() are considered to be external local outputs (legacy thing). If you have an absolute path already to a file, you should be able to just open it yourself, as dvc.api.open is really meant to work with dvc repositories with paths that are detached from where your repo is located.

Closing as expected behaviour.

2reactions
dberenbaumcommented, Jul 27, 2022

@efiop Let’s keep it open as a feature request?

It makes sense that to avoid having to separately pass a repo URL and a separate relative path URL. @efiop and I have discussed that we both would prefer to get rid of this distinction.

I can’t promise that we will get to it any time soon, though. I think the current behavior is expected and explained in the docstring:

https://github.com/iterative/dvc/blob/44a4fb59a7ef812ff84671fc68d1ae7843b9ac03/dvc/api/data.py#L73-L77

The same is explained in https://dvc.org/doc/api-reference/open#parameters:

path (required) - location and file name of the target to open, relative to the root of the project (repo).

repo - specifies the location of the DVC project. It can be a URL or a file system path. Both HTTP and SSH protocols are > supported for online Git repos (e.g. [user@]server:project.git). Default: The current project is used (the current working > directory tree is walked up to find it).

Read more comments on GitHub >

github_iconTop Results From Across the Web

dvc.api.open()
path (required) - location and file name of the target to open, relative to the root of the project ( repo ). repo...
Read more >
dvc.api.read() raises an "UnicodeDecodeError" - Stack Overflow
The scope of using dvc.api.read() is only to retrieve/stream the data files from DVC remote to a Python script. Only afterwards can Pydicom,...
Read more >
Data & Model Management with DVC | Analytics Vidhya
DVC, developed by Iterative.AI is an open source command-line tool written in Python for data science & ML project management & versioning. To ......
Read more >
Data Version Control With Python and DVC - Real Python
dvc file is lightweight and meant to be stored with your code in GitHub. When you download a Git repository, you also get...
Read more >
API Reference — fsspec 2022.11.0+13.g0974514.dirty ...
Given a path or paths, return one OpenFile object. fsspec.open_local (url[, mode]). Open file(s) which can be resolved to local.
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found