question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

dvc pull returns "failed to pull data" when the data exists on remote

See original GitHub issue

Bug Report

Issue name

dvc pull returns “failed to pull data” when the data exists on remote

Description

dvc pull (also tried with -R option) fails to pull remote data basing on .dvc files from sub-directories and returns ERROR: failed to pull data from the cloud - Checkout failed for following targets:..., however, when I run the pull cmd on failed files individually, the cmd succeeds.

(onboarding_models) radion@MacBook-Pro-Radion anna-datascience % dvc pull
WARNING: Some of the cache files do not exist neither locally nor on remote. Missing cache files:                                                                                                                                                                                                                                                                    
name: document_labelling_utils/annotation_results/1000_recent_documents_20210413.json, md5: 06a0a6ef5b6446a33623a544ede8bbfd
1 file failed                                                                                                                                                                                                                                                                                                                                                        
ERROR: failed to pull data from the cloud - Checkout failed for following targets:
document_labelling_utils/annotation_results/1000_recent_documents_20210413.json
Is your cache up to date?
<https://error.dvc.org/missing-files>
(onboarding_models) radion@MacBook-Pro-Radion anna-datascience % dvc pull -R
WARNING: Some of the cache files do not exist neither locally nor on remote. Missing cache files:                                                                                                                                                                                                                                                                    
name: document_labelling_utils/annotation_results/1000_recent_documents_20210413.json, md5: 06a0a6ef5b6446a33623a544ede8bbfd
1 file failed                                                                                                                                                                                                                                                                                                                                                        
ERROR: failed to pull data from the cloud - Checkout failed for following targets:
document_labelling_utils/annotation_results/1000_recent_documents_20210413.json
Is your cache up to date?
<https://error.dvc.org/missing-files>
(onboarding_models) radion@MacBook-Pro-Radion anna-datascience % dvc pull document_labelling_utils/annotation_results/1000_recent_documents_20210413.json.dvc 
A       document_labelling_utils/annotation_results/1000_recent_documents_20210413.json                                                                                                                                                                                                                                                                              
1 file added and 1 file fetched                                                                                                                                                                                                                                                                                                                                      
(onboarding_models) radion@MacBook-Pro-Radion anna-datascience % dvc pull                                                                                    
Everything is up to date.                                                                                                                                                                                                                                                                                                                                            

Expected

I expect dvc pull to download missing files from sub-directories without the need to run it on each .dvc file.

Environment information

Output of dvc doctor:

DVC version: 2.7.2 (brew)
---------------------------------
Platform: Python 3.9.7 on macOS-11.2.1-x86_64-i386-64bit
Supports:
        azure (adlfs = 2021.8.2, knack = 0.8.2, azure-identity = 1.6.1),
        gdrive (pydrive2 = 1.9.3),
        gs (gcsfs = 2021.8.1),
        http (aiohttp = 3.7.4.post0, aiohttp-retry = 2.4.5),
        https (aiohttp = 3.7.4.post0, aiohttp-retry = 2.4.5),
        s3 (s3fs = 2021.8.1, boto3 = 1.17.106),
        webdav (webdav4 = 0.9.1),
        webdavs (webdav4 = 0.9.1)
Cache types: reflink, hardlink, symlink
Cache directory: apfs on /dev/disk1s1s1
Caches: local
Remotes: gs
Workspace directory: apfs on /dev/disk1s1s1
Repo: dvc, git

Issue Analytics

  • State:closed
  • Created 2 years ago
  • Comments:14 (2 by maintainers)

github_iconTop GitHub Comments

2reactions
clementperoncommented, Oct 5, 2021

The issue has been introduced in 2.5.0.

$> pip install dvc[gs]==2.4.3
dvc pull -f toto1.dvc toto2.dvc

Is OK

$> pip install dvc[gs]==2.5.0
dvc pull -f toto1.dvc toto2.dvc

Failed !!

1reaction
clementperoncommented, Oct 7, 2021

Looks like removing the TRAVERSE_PREFIX_LEN fix my issue.

diff --git a/dvc/fs/gs.py b/dvc/fs/gs.py
index 6ee6d735..2d73f063 100644
--- a/dvc/fs/gs.py
+++ b/dvc/fs/gs.py
@@ -16,7 +16,6 @@ class GSFileSystem(CallbackMixin, ObjectFSWrapper):
     REQUIRES = {"gcsfs": "gcsfs"}
     PARAM_CHECKSUM = "etag"
     DETAIL_FIELDS = frozenset(("etag", "size"))
-    TRAVERSE_PREFIX_LEN = 2
 
     def _prepare_credentials(self, **config):
         login_info = {"consistency": None}

Tested on master.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Troubleshooting | Data Version Control - DVC
Users may encounter errors when running dvc pull and dvc fetch , like WARNING: Cache 'xxxx' not found. or ERROR: failed to pull...
Read more >
"Error: Failed to pull data from the cloud" when pulled ... - GitHub
When pulling data from remote storage, I execute the following command: dvc pull train.dvc with content of the file: train.dvc cmd: python ...
Read more >
Getting this weird error when trying to run DVC pull
I am trying to pull data from s3 that was pushed by another person on my team. But I am getting this error:...
Read more >
5.1. Reproducible machine learning analyses: DataLad as DVC
But just like any data analysis project, machine learning projects can ... from the data remote to repopulate the cache is done with...
Read more >
Data Version Control With Python and DVC - Real Python
Large data and model files go in your DVC remote storage, and small .dvc files that ... You can then extract the dataset...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found