question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

import-url from a google bucket crashes with file not found

See original GitHub issue

Bug Report

import-url fails to import from bucket

Description

When using import-url to import data from a bucket, the command fails with a file not found error. DVC manages to correctly download all files but crashes afterwards. The stack trace I get while running dvc import-url gs://my-bucket data/raw -v is the following (I had to anonymize the names of the files):

Traceback (most recent call last):
  File "/home/user/.local/share/virtualenvs/projectlib/python3.10/site-packages/dvc/cli/__init__.py", line 185, in main
    ret = cmd.do_run()
  File "/home/user/.local/share/virtualenvs/projectlib/python3.10/site-packages/dvc/cli/command.py", line 22, in do_run
    return self.run()
  File "/home/user/.local/share/virtualenvs/projectlib/python3.10/site-packages/dvc/commands/imp_url.py", line 15, in run
    self.repo.imp_url(
  File "/home/user/.local/share/virtualenvs/projectlib/python3.10/site-packages/dvc/repo/__init__.py", line 48, in wrapper
    return f(repo, *args, **kwargs)
  File "/home/user/.local/share/virtualenvs/projectlib/python3.10/site-packages/dvc/repo/scm_context.py", line 156, in run
    return method(repo, *args, **kw)
  File "/home/user/.local/share/virtualenvs/projectlib/python3.10/site-packages/dvc/repo/imp_url.py", line 83, in imp_url
    stage.run(jobs=jobs)
  File "/home/user/.local/share/virtualenvs/projectlib/python3.10/site-packages/funcy/decorators.py", line 45, in wrapper
    return deco(call, *dargs, **dkwargs)
  File "/home/user/.local/share/virtualenvs/projectlib/python3.10/site-packages/dvc/stage/decorators.py", line 36, in rwlocked
    return call()
  File "/home/user/.local/share/virtualenvs/projectlib/python3.10/site-packages/funcy/decorators.py", line 66, in __call__
    return self._func(*self._args, **self._kwargs)
  File "/home/user/.local/share/virtualenvs/projectlib/python3.10/site-packages/dvc/stage/__init__.py", line 549, in run
    self.save(allow_missing=allow_missing)
  File "/home/user/.local/share/virtualenvs/projectlib/python3.10/site-packages/dvc/stage/__init__.py", line 459, in save
    self.save_deps(allow_missing=allow_missing)
  File "/home/user/.local/share/virtualenvs/projectlib/python3.10/site-packages/dvc/stage/__init__.py", line 470, in save_deps
    dep.save()
  File "/home/user/.local/share/virtualenvs/projectlib/python3.10/site-packages/dvc/output.py", line 552, in save
    _, self.meta, obj = build(
  File "/home/user/.local/share/virtualenvs/projectlib/python3.10/site-packages/dvc_data/build.py", line 240, in build
    meta, obj = _build_tree(
  File "/home/user/.local/share/virtualenvs/projectlib/python3.10/site-packages/dvc_data/build.py", line 137, in _build_tree
    meta, obj = _build_file(
  File "/home/user/.local/share/virtualenvs/projectlib/python3.10/site-packages/dvc_data/build.py", line 64, in _build_file
    meta, hash_info = hash_file(path, fs, name, state=state)
  File "/home/user/.local/share/virtualenvs/projectlib/python3.10/site-packages/dvc_data/hashfile/hash.py", line 178, in hash_file
    hash_value, info = _hash_file(path, fs, name, callback=cb)
  File "/home/user/.local/share/virtualenvs/projectlib/python3.10/site-packages/dvc_data/hashfile/hash.py", line 123, in _hash_file
    info = _adapt_info(fs.info(path), fs.protocol)
  File "/home/user/.local/share/virtualenvs/projectlib/python3.10/site-packages/dvc_objects/fs/base.py", line 346, in info
    return self.fs.info(path)
  File "/home/user/.local/share/virtualenvs/projectlib/python3.10/site-packages/fsspec/asyn.py", line 86, in wrapper
    return sync(self.loop, func, *args, **kwargs)
  File "/home/user/.local/share/virtualenvs/projectlib/python3.10/site-packages/fsspec/asyn.py", line 66, in sync
    raise return_result
  File "/home/user/.local/share/virtualenvs/projectlib/python3.10/site-packages/fsspec/asyn.py", line 26, in _runner
    result[0] = await coro
  File "/home/user/.local/share/virtualenvs/projectlib/python3.10/site-packages/gcsfs/core.py", line 706, in _info
    if self._ls_from_cache(path):
  File "/home/user/.local/share/virtualenvs/projectlib/python3.10/site-packages/fsspec/spec.py", line 362, in _ls_from_cache
    raise FileNotFoundError(path)
FileNotFoundError: some_file_that_doesnt_exist.xlsx

If that can help, the file that DVC claims is missing used to exist in the bucket at some point in the past and some files inside sub folders in the bucket have the same name. I tried downgrading to dvc 2.12.0 and the issue goes away.

Reproduce

dvc import-url gs://my-bucket data/raw

Expected

The command completes successfully.

Environment information

Output of dvc doctor:

$ dvc doctor
DVC version: 2.13.0 (pip)
---------------------------------
Platform: Python 3.10.4 on Linux-5.15.0-1010-gcp-x86_64-with-glibc2.35
Supports:
        gs (gcsfs = 2022.5.0),
        webhdfs (fsspec = 2022.5.0),
        http (aiohttp = 3.8.1, aiohttp-retry = 2.5.2),
        https (aiohttp = 3.8.1, aiohttp-retry = 2.5.2)
Cache types: hardlink, symlink
Cache directory: ext4 on /dev/sda1
Caches: local
Remotes: None
Workspace directory: ext4 on /dev/sda1
Repo: dvc, git

Additional Information (if any):

Issue Analytics

  • State:closed
  • Created a year ago
  • Comments:12 (9 by maintainers)

github_iconTop GitHub Comments

1reaction
dtrifirocommented, Aug 31, 2022

Fix was included dvc-gs==2.19.1, which is included starting with dvc==2.22.0

1reaction
dtrifirocommented, Aug 11, 2022
Read more comments on GitHub >

github_iconTop Results From Across the Web

Loading File From GCS Failed with "Not Found ... - Issue Tracker
We have a process that uses the Google Cloud Storage JSON API for uploading the files from our servers, and than using the...
Read more >
Troubleshooting | Cloud Storage
This page describes troubleshooting methods for common errors you may encounter while using Cloud Storage. See the Google Cloud Status Dashboard for ...
Read more >
Troubleshooting | Data Version Control - DVC
Failed to pull data from the cloud · Too many open files error · Unable to find credentials · Unable to connect ·...
Read more >
memory troubles and modified imports for pipeline running on ...
(gcloud beta lifesciences pipelines run --pipeline-file ... Your dnastructs file is missing the import url at the bottom. Any of the files ......
Read more >
Reading data from GCS with BigQuery fails with "Not Found ...
It could be for some reason that you get this error. When you load data from Cloud Storage into a BigQuery table, the...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found