question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

dvc.api.open: fails to open a file within a dvc tracked directory from a remote storage

See original GitHub issue

Bug Report

dvc.api.open fails with a FileMissingError when trying to open a file within a dvc tracked directory from a remote storage (s3).

Description

Reproduce

  • Within an initialized and configured gitand dvc repository create a test directory containing the file test/foo
  • dvc add test && dvc push
  • Delete the test directory
  • Run:
from dvc.api import open

with open('test/foo', mode='r') as file:
    pass

Environment information

Output of dvc doctor:

DVC version: 2.21.0 (pip)
---------------------------------
Platform: Python 3.9.13 on macOS-12.2.1-x86_64-i386-64bit
Supports:
	http (aiohttp = 3.8.1, aiohttp-retry = 2.8.3),
	https (aiohttp = 3.8.1, aiohttp-retry = 2.8.3),
	s3 (s3fs = 2022.7.1, boto3 = 1.21.21),
	ssh (sshfs = 2022.6.0)
Cache types: reflink, hardlink, symlink
Cache directory: apfs on /dev/disk1s5s1
Caches: local
Remotes: s3
Workspace directory: apfs on /dev/disk1s5s1
Repo: dvc, git

Generated error

---------------------------------------------------------------------------
FileNotFoundError                         Traceback (most recent call last)
File ~/.pyenv/versions/data-science3.9/lib/python3.9/site-packages/dvc/repo/__init__.py:559, in Repo.open_by_relpath(self, path, remote, mode, encoding)
    557     fs_path = remote_odb.oid_to_path(oid)
--> 559 with fs.open(
    560     fs_path,
    561     mode=mode,
    562     encoding=encoding,
    563 ) as fobj:
    564     yield fobj

File ~/.pyenv/versions/data-science3.9/lib/python3.9/site-packages/dvc_objects/fs/base.py:191, in FileSystem.open(self, path, mode, **kwargs)
    190     kwargs.pop("encoding", None)
--> 191 return self.fs.open(path, mode=mode, **kwargs)

File ~/.pyenv/versions/data-science3.9/lib/python3.9/site-packages/dvc/fs/dvc.py:328, in _DvcFileSystem.open(self, path, mode, encoding, **kwargs)
    326         raise
--> 328 return dvc_fs.open(dvc_path, mode=mode, encoding=encoding, **kwargs)

File ~/.pyenv/versions/data-science3.9/lib/python3.9/site-packages/dvc_objects/fs/base.py:191, in FileSystem.open(self, path, mode, **kwargs)
    190     kwargs.pop("encoding", None)
--> 191 return self.fs.open(path, mode=mode, **kwargs)

File ~/.pyenv/versions/data-science3.9/lib/python3.9/site-packages/dvc_data/fs.py:70, in DataFileSystem.open(self, path, mode, encoding, **kwargs)
     67 def open(  # type: ignore
     68     self, path: str, mode="r", encoding=None, **kwargs
     69 ):  # pylint: disable=arguments-renamed, arguments-differ
---> 70     fs, fspath = self._get_fs_path(path, **kwargs)
     71     return fs.open(fspath, mode=mode, encoding=encoding)

File ~/.pyenv/versions/data-science3.9/lib/python3.9/site-packages/dvc_data/fs.py:52, in DataFileSystem._get_fs_path(self, path)
     51 if not value:
---> 52     raise FileNotFoundError
     54 entry = info["entry"]

FileNotFoundError: 

The above exception was the direct cause of the following exception:

FileMissingError                          Traceback (most recent call last)
Input In [2], in <cell line: 1>()
----> 1 with open('test/foo', mode='r') as f:
      2     print(f.read())

File ~/.pyenv/versions/3.9.13/lib/python3.9/contextlib.py:119, in _GeneratorContextManager.__enter__(self)
    117 del self.args, self.kwds, self.func
    118 try:
--> 119     return next(self.gen)
    120 except StopIteration:
    121     raise RuntimeError("generator didn't yield") from None

File ~/.pyenv/versions/data-science3.9/lib/python3.9/site-packages/dvc/api/data.py:198, in _open(path, repo, rev, remote, mode, encoding)
    196 def _open(path, repo=None, rev=None, remote=None, mode="r", encoding=None):
    197     with Repo.open(repo, rev=rev, subrepos=True, uninitialized=True) as _repo:
--> 198         with _repo.open_by_relpath(
    199             path, remote=remote, mode=mode, encoding=encoding
    200         ) as fd:
    201             yield fd

File ~/.pyenv/versions/3.9.13/lib/python3.9/contextlib.py:119, in _GeneratorContextManager.__enter__(self)
    117 del self.args, self.kwds, self.func
    118 try:
--> 119     return next(self.gen)
    120 except StopIteration:
    121     raise RuntimeError("generator didn't yield") from None

File ~/.pyenv/versions/data-science3.9/lib/python3.9/site-packages/dvc/repo/__init__.py:566, in Repo.open_by_relpath(self, path, remote, mode, encoding)
    564         yield fobj
    565 except FileNotFoundError as exc:
--> 566     raise FileMissingError(path) from exc
    567 except IsADirectoryError as exc:
    568     raise DvcIsADirectoryError(f"'{path}' is a directory") from exc

FileMissingError: Can't find 'test/foo' neither locally nor on remote

Potential Fix

According to the error, the problem occurs when trying to get the file’s hash value through the info object in dvc_data/fs.py _get_fs_path method. It works fine with a single file, as the info object generates a HashInfo object in this case with a md5 key. However, when trying to load the file from within a dvc tracked directory, the HashInfo object in info associates the file’s hash with the etag key. Changing _get_fs_path value object this way solved the problem for me:

    def _get_fs_path(self, path: "AnyFSPath"):
        info = self.info(path)
        if info["type"] == "directory":
            raise IsADirectoryError

        value = info.get("md5") or info.get("etag")
        if not value:
            raise FileNotFoundError

        entry = info["entry"]

        cache_path = entry.odb.oid_to_path(value)

        if entry.odb.fs.exists(cache_path):
            return entry.odb.fs, cache_path

        if not entry.remote:
            raise FileNotFoundError

        remote_fs_path = entry.remote.oid_to_path(value)
        return entry.remote.fs, remote_fs_path

Issue Analytics

  • State:closed
  • Created a year ago
  • Comments:12 (4 by maintainers)

github_iconTop GitHub Comments

1reaction
francoispichot1commented, Nov 24, 2022

I think I know what caused the problem in the first place but I can’t explain it precisely @efiop,

I pushed my data into an old buket some time ago, and was trying to open it from another bucket where this data has been copied, seems like the copy modified something preventing me from opening it. It’s really weird because I could dvc pull my data but couldn’t dvc.api.open it. I haven’t find the origin of the problem yet, closing for now.

1reaction
daavoocommented, Oct 11, 2022

Looks like the problem in your case was introduced in https://github.com/iterative/dvc/pull/7353 but I am still unable to reproduce

Read more comments on GitHub >

github_iconTop Results From Across the Web

Troubleshooting | Data Version Control - DVC
Failed to pull data from the cloud · Too many open files error · Unable to find credentials · Unable to connect ·...
Read more >
shcheklein/example-get-started: Get started DVC project
This is an auto-generated repository for use in DVC Get Started. It is a step-by-step quick introduction into basic DVC concepts.
Read more >
Data Version Control With Python and DVC - Real Python
Large data and model files go in your DVC remote storage, and small .dvc files ... In principle, you don't ever need to...
Read more >
Managing ML Training Data with DVC and Determined
Experiment tracking made easier with DVC and Determined ... The data files can then be pushed to remote storage such as AWS or...
Read more >
How to fix DVC error 'FileNotFoundError: [Errno 2] No such file ...
OK but pulling from a data registry is also unclear. Typically you import from a data registry unless you mean a data remote....
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found