import-url from a google bucket crashes with file not found
See original GitHub issueBug Report
import-url
fails to import from bucket
Description
When using import-url
to import data from a bucket, the command fails with a file not found error. DVC manages to correctly download all files but crashes afterwards.
The stack trace I get while running dvc import-url gs://my-bucket data/raw -v
is the following (I had to anonymize the names of the files):
Traceback (most recent call last):
File "/home/user/.local/share/virtualenvs/projectlib/python3.10/site-packages/dvc/cli/__init__.py", line 185, in main
ret = cmd.do_run()
File "/home/user/.local/share/virtualenvs/projectlib/python3.10/site-packages/dvc/cli/command.py", line 22, in do_run
return self.run()
File "/home/user/.local/share/virtualenvs/projectlib/python3.10/site-packages/dvc/commands/imp_url.py", line 15, in run
self.repo.imp_url(
File "/home/user/.local/share/virtualenvs/projectlib/python3.10/site-packages/dvc/repo/__init__.py", line 48, in wrapper
return f(repo, *args, **kwargs)
File "/home/user/.local/share/virtualenvs/projectlib/python3.10/site-packages/dvc/repo/scm_context.py", line 156, in run
return method(repo, *args, **kw)
File "/home/user/.local/share/virtualenvs/projectlib/python3.10/site-packages/dvc/repo/imp_url.py", line 83, in imp_url
stage.run(jobs=jobs)
File "/home/user/.local/share/virtualenvs/projectlib/python3.10/site-packages/funcy/decorators.py", line 45, in wrapper
return deco(call, *dargs, **dkwargs)
File "/home/user/.local/share/virtualenvs/projectlib/python3.10/site-packages/dvc/stage/decorators.py", line 36, in rwlocked
return call()
File "/home/user/.local/share/virtualenvs/projectlib/python3.10/site-packages/funcy/decorators.py", line 66, in __call__
return self._func(*self._args, **self._kwargs)
File "/home/user/.local/share/virtualenvs/projectlib/python3.10/site-packages/dvc/stage/__init__.py", line 549, in run
self.save(allow_missing=allow_missing)
File "/home/user/.local/share/virtualenvs/projectlib/python3.10/site-packages/dvc/stage/__init__.py", line 459, in save
self.save_deps(allow_missing=allow_missing)
File "/home/user/.local/share/virtualenvs/projectlib/python3.10/site-packages/dvc/stage/__init__.py", line 470, in save_deps
dep.save()
File "/home/user/.local/share/virtualenvs/projectlib/python3.10/site-packages/dvc/output.py", line 552, in save
_, self.meta, obj = build(
File "/home/user/.local/share/virtualenvs/projectlib/python3.10/site-packages/dvc_data/build.py", line 240, in build
meta, obj = _build_tree(
File "/home/user/.local/share/virtualenvs/projectlib/python3.10/site-packages/dvc_data/build.py", line 137, in _build_tree
meta, obj = _build_file(
File "/home/user/.local/share/virtualenvs/projectlib/python3.10/site-packages/dvc_data/build.py", line 64, in _build_file
meta, hash_info = hash_file(path, fs, name, state=state)
File "/home/user/.local/share/virtualenvs/projectlib/python3.10/site-packages/dvc_data/hashfile/hash.py", line 178, in hash_file
hash_value, info = _hash_file(path, fs, name, callback=cb)
File "/home/user/.local/share/virtualenvs/projectlib/python3.10/site-packages/dvc_data/hashfile/hash.py", line 123, in _hash_file
info = _adapt_info(fs.info(path), fs.protocol)
File "/home/user/.local/share/virtualenvs/projectlib/python3.10/site-packages/dvc_objects/fs/base.py", line 346, in info
return self.fs.info(path)
File "/home/user/.local/share/virtualenvs/projectlib/python3.10/site-packages/fsspec/asyn.py", line 86, in wrapper
return sync(self.loop, func, *args, **kwargs)
File "/home/user/.local/share/virtualenvs/projectlib/python3.10/site-packages/fsspec/asyn.py", line 66, in sync
raise return_result
File "/home/user/.local/share/virtualenvs/projectlib/python3.10/site-packages/fsspec/asyn.py", line 26, in _runner
result[0] = await coro
File "/home/user/.local/share/virtualenvs/projectlib/python3.10/site-packages/gcsfs/core.py", line 706, in _info
if self._ls_from_cache(path):
File "/home/user/.local/share/virtualenvs/projectlib/python3.10/site-packages/fsspec/spec.py", line 362, in _ls_from_cache
raise FileNotFoundError(path)
FileNotFoundError: some_file_that_doesnt_exist.xlsx
If that can help, the file that DVC claims is missing used to exist in the bucket at some point in the past and some files inside sub folders in the bucket have the same name. I tried downgrading to dvc 2.12.0 and the issue goes away.
Reproduce
dvc import-url gs://my-bucket data/raw
Expected
The command completes successfully.
Environment information
Output of dvc doctor
:
$ dvc doctor
DVC version: 2.13.0 (pip)
---------------------------------
Platform: Python 3.10.4 on Linux-5.15.0-1010-gcp-x86_64-with-glibc2.35
Supports:
gs (gcsfs = 2022.5.0),
webhdfs (fsspec = 2022.5.0),
http (aiohttp = 3.8.1, aiohttp-retry = 2.5.2),
https (aiohttp = 3.8.1, aiohttp-retry = 2.5.2)
Cache types: hardlink, symlink
Cache directory: ext4 on /dev/sda1
Caches: local
Remotes: None
Workspace directory: ext4 on /dev/sda1
Repo: dvc, git
Additional Information (if any):
Issue Analytics
- State:
- Created a year ago
- Comments:12 (9 by maintainers)
Top Results From Across the Web
Loading File From GCS Failed with "Not Found ... - Issue Tracker
We have a process that uses the Google Cloud Storage JSON API for uploading the files from our servers, and than using the...
Read more >Troubleshooting | Cloud Storage
This page describes troubleshooting methods for common errors you may encounter while using Cloud Storage. See the Google Cloud Status Dashboard for ...
Read more >Troubleshooting | Data Version Control - DVC
Failed to pull data from the cloud · Too many open files error · Unable to find credentials · Unable to connect ·...
Read more >memory troubles and modified imports for pipeline running on ...
(gcloud beta lifesciences pipelines run --pipeline-file ... Your dnastructs file is missing the import url at the bottom. Any of the files ......
Read more >Reading data from GCS with BigQuery fails with "Not Found ...
It could be for some reason that you get this error. When you load data from Cloud Storage into a BigQuery table, the...
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
Fix was included
dvc-gs==2.19.1
, which is included starting withdvc==2.22.0
Waiting on https://github.com/fsspec/gcsfs/pull/488