dvc.api.open: fails to open a file within a dvc tracked directory from a remote storage
See original GitHub issueBug Report
dvc.api.open fails with a FileMissingError
when trying to open a file within a dvc tracked directory from a remote storage (s3).
Description
Reproduce
- Within an initialized and configured
git
anddvc
repository create atest
directory containing the filetest/foo
dvc add test && dvc push
- Delete the
test
directory - Run:
from dvc.api import open
with open('test/foo', mode='r') as file:
pass
Environment information
Output of dvc doctor
:
DVC version: 2.21.0 (pip)
---------------------------------
Platform: Python 3.9.13 on macOS-12.2.1-x86_64-i386-64bit
Supports:
http (aiohttp = 3.8.1, aiohttp-retry = 2.8.3),
https (aiohttp = 3.8.1, aiohttp-retry = 2.8.3),
s3 (s3fs = 2022.7.1, boto3 = 1.21.21),
ssh (sshfs = 2022.6.0)
Cache types: reflink, hardlink, symlink
Cache directory: apfs on /dev/disk1s5s1
Caches: local
Remotes: s3
Workspace directory: apfs on /dev/disk1s5s1
Repo: dvc, git
Generated error
---------------------------------------------------------------------------
FileNotFoundError Traceback (most recent call last)
File ~/.pyenv/versions/data-science3.9/lib/python3.9/site-packages/dvc/repo/__init__.py:559, in Repo.open_by_relpath(self, path, remote, mode, encoding)
557 fs_path = remote_odb.oid_to_path(oid)
--> 559 with fs.open(
560 fs_path,
561 mode=mode,
562 encoding=encoding,
563 ) as fobj:
564 yield fobj
File ~/.pyenv/versions/data-science3.9/lib/python3.9/site-packages/dvc_objects/fs/base.py:191, in FileSystem.open(self, path, mode, **kwargs)
190 kwargs.pop("encoding", None)
--> 191 return self.fs.open(path, mode=mode, **kwargs)
File ~/.pyenv/versions/data-science3.9/lib/python3.9/site-packages/dvc/fs/dvc.py:328, in _DvcFileSystem.open(self, path, mode, encoding, **kwargs)
326 raise
--> 328 return dvc_fs.open(dvc_path, mode=mode, encoding=encoding, **kwargs)
File ~/.pyenv/versions/data-science3.9/lib/python3.9/site-packages/dvc_objects/fs/base.py:191, in FileSystem.open(self, path, mode, **kwargs)
190 kwargs.pop("encoding", None)
--> 191 return self.fs.open(path, mode=mode, **kwargs)
File ~/.pyenv/versions/data-science3.9/lib/python3.9/site-packages/dvc_data/fs.py:70, in DataFileSystem.open(self, path, mode, encoding, **kwargs)
67 def open( # type: ignore
68 self, path: str, mode="r", encoding=None, **kwargs
69 ): # pylint: disable=arguments-renamed, arguments-differ
---> 70 fs, fspath = self._get_fs_path(path, **kwargs)
71 return fs.open(fspath, mode=mode, encoding=encoding)
File ~/.pyenv/versions/data-science3.9/lib/python3.9/site-packages/dvc_data/fs.py:52, in DataFileSystem._get_fs_path(self, path)
51 if not value:
---> 52 raise FileNotFoundError
54 entry = info["entry"]
FileNotFoundError:
The above exception was the direct cause of the following exception:
FileMissingError Traceback (most recent call last)
Input In [2], in <cell line: 1>()
----> 1 with open('test/foo', mode='r') as f:
2 print(f.read())
File ~/.pyenv/versions/3.9.13/lib/python3.9/contextlib.py:119, in _GeneratorContextManager.__enter__(self)
117 del self.args, self.kwds, self.func
118 try:
--> 119 return next(self.gen)
120 except StopIteration:
121 raise RuntimeError("generator didn't yield") from None
File ~/.pyenv/versions/data-science3.9/lib/python3.9/site-packages/dvc/api/data.py:198, in _open(path, repo, rev, remote, mode, encoding)
196 def _open(path, repo=None, rev=None, remote=None, mode="r", encoding=None):
197 with Repo.open(repo, rev=rev, subrepos=True, uninitialized=True) as _repo:
--> 198 with _repo.open_by_relpath(
199 path, remote=remote, mode=mode, encoding=encoding
200 ) as fd:
201 yield fd
File ~/.pyenv/versions/3.9.13/lib/python3.9/contextlib.py:119, in _GeneratorContextManager.__enter__(self)
117 del self.args, self.kwds, self.func
118 try:
--> 119 return next(self.gen)
120 except StopIteration:
121 raise RuntimeError("generator didn't yield") from None
File ~/.pyenv/versions/data-science3.9/lib/python3.9/site-packages/dvc/repo/__init__.py:566, in Repo.open_by_relpath(self, path, remote, mode, encoding)
564 yield fobj
565 except FileNotFoundError as exc:
--> 566 raise FileMissingError(path) from exc
567 except IsADirectoryError as exc:
568 raise DvcIsADirectoryError(f"'{path}' is a directory") from exc
FileMissingError: Can't find 'test/foo' neither locally nor on remote
Potential Fix
According to the error, the problem occurs when trying to get the file’s hash value through the info
object in dvc_data/fs.py
_get_fs_path
method. It works fine with a single file, as the info
object generates a HashInfo
object in this case with a md5
key. However, when trying to load the file from within a dvc tracked directory, the HashInfo
object in info
associates the file’s hash with the etag
key. Changing _get_fs_path
value
object this way solved the problem for me:
def _get_fs_path(self, path: "AnyFSPath"):
info = self.info(path)
if info["type"] == "directory":
raise IsADirectoryError
value = info.get("md5") or info.get("etag")
if not value:
raise FileNotFoundError
entry = info["entry"]
cache_path = entry.odb.oid_to_path(value)
if entry.odb.fs.exists(cache_path):
return entry.odb.fs, cache_path
if not entry.remote:
raise FileNotFoundError
remote_fs_path = entry.remote.oid_to_path(value)
return entry.remote.fs, remote_fs_path
Issue Analytics
- State:
- Created a year ago
- Comments:12 (4 by maintainers)
Top Results From Across the Web
Troubleshooting | Data Version Control - DVC
Failed to pull data from the cloud · Too many open files error · Unable to find credentials · Unable to connect ·...
Read more >shcheklein/example-get-started: Get started DVC project
This is an auto-generated repository for use in DVC Get Started. It is a step-by-step quick introduction into basic DVC concepts.
Read more >Data Version Control With Python and DVC - Real Python
Large data and model files go in your DVC remote storage, and small .dvc files ... In principle, you don't ever need to...
Read more >Managing ML Training Data with DVC and Determined
Experiment tracking made easier with DVC and Determined ... The data files can then be pushed to remote storage such as AWS or...
Read more >How to fix DVC error 'FileNotFoundError: [Errno 2] No such file ...
OK but pulling from a data registry is also unclear. Typically you import from a data registry unless you mean a data remote....
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
I think I know what caused the problem in the first place but I can’t explain it precisely @efiop,
I pushed my data into an old buket some time ago, and was trying to open it from another bucket where this data has been copied, seems like the copy modified something preventing me from opening it. It’s really weird because I could
dvc pull
my data but couldn’tdvc.api.open
it. I haven’t find the origin of the problem yet, closing for now.Looks like the problem in your case was introduced in https://github.com/iterative/dvc/pull/7353 but I am still unable to reproduce