question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

dvc pull not fetching all data (cache file not found)

See original GitHub issue

Please provide information about your setup

dvc --version 0.23.2

uname -a Linux arachne-postgres 4.15.0-36-generic #39-Ubuntu SMP Mon Sep 24 16:19:09 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux

On server-1 data pushed to remote is in cache and in s3

$ ls -l .dvc/cache/9c/
total 63360
-rw-rw-r-- 1 ubuntu ubuntu   783548 Jan  4 20:53 01b58cb0faab4ee28a9228552ffd8d
-rw-rw-r-- 1 ubuntu ubuntu  3719779 Jan  4 20:53 2861308e6110dc7f4850cbe331e63a
-rw-rw-r-- 2 ubuntu ubuntu    14722 Jan  4 20:51 2fa17d3b0c9486c5af435329f62151
-rw-rw-r-- 1 ubuntu ubuntu   849013 Jan  4 20:52 416b598605a8fcd6fc04c2edab4edc
-rw-rw-r-- 2 ubuntu ubuntu    22852 Jan  4 20:52 55bb368600627e7e14ad7648d8f26b
-rw-rw-r-- 2 ubuntu ubuntu    39899 Jan  4 20:52 5ffd6a38f12b14c6a5e6aafacf133c
-rw-rw-r-- 1 ubuntu ubuntu   614053 Jan  4 20:51 711b4315305a21dabfa44e74740ff7
-rw-rw-r-- 2 ubuntu ubuntu    23825 Jan  4 20:52 765b898179e0c88af48a638dfe6586
-rw-rw-r-- 1 ubuntu ubuntu   148555 Jan  7 07:36 7b5eca364544bf04f63c390dde7f6e.dir
-rw-rw-r-- 1 ubuntu ubuntu   865287 Jan  4 20:53 7f3444015a68ee039d84986d5f9a98
-rw-rw-r-- 2 ubuntu ubuntu    18086 Jan  4 20:52 8bc9b783a8e0568e71d350e4a1fc37
-rw-rw-r-- 2 ubuntu ubuntu     9823 Jan  4 20:52 95e0f6ac3455b66b69974c3936eba7
-rw-rw-r-- 1 ubuntu ubuntu   673028 Jan  4 20:52 997353cce6f0997dc97e311893f3fb
-rw-rw-r-- 1 ubuntu ubuntu 54562699 Jan  4 22:43 9d7b83b536edc6666c76d16e9bfc6b
-rw-rw-r-- 1 ubuntu ubuntu   436909 Jan  4 20:52 a485af08514391eaf58f7a607a1aaa
-rw-rw-r-- 1 ubuntu ubuntu   365795 Jan  4 20:51 a76a77aa1343aba087211e42f6d2b7
-rw-rw-r-- 1 ubuntu ubuntu  1232517 Jan  4 20:53 ae610c9a0246025b573888e84d766e
-rw-rw-r-- 1 ubuntu ubuntu   364246 Jan  4 20:51 bec0f4fc35ecf1377e62c3859362c4
-rw-rw-r-- 1 ubuntu ubuntu    59057 Nov  9 02:47 cf83393276d56191a88c3d54ef6a5d
-rw-rw-r-- 2 ubuntu ubuntu    33169 Jan  4 20:52 e00d88888f810720aee5b46c3f0772

aws  --endpoint=https://ceph.acc.ohsu.edu s3 ls s3://bmeg/dvc/9c/
2019-01-08 04:21:07     783548 01b58cb0faab4ee28a9228552ffd8d
2019-01-08 04:20:50    3719779 2861308e6110dc7f4850cbe331e63a
2019-01-08 04:01:19      14722 2fa17d3b0c9486c5af435329f62151
2019-01-08 04:20:47     849013 416b598605a8fcd6fc04c2edab4edc
2019-01-08 04:02:18      22852 55bb368600627e7e14ad7648d8f26b
2018-12-18 21:41:50      62535 5b6377d120103a5a8e841a7b94ff4c
2019-01-08 04:02:14      39899 5ffd6a38f12b14c6a5e6aafacf133c
2019-01-08 04:19:52     614053 711b4315305a21dabfa44e74740ff7
2019-01-08 04:01:46      23825 765b898179e0c88af48a638dfe6586
2019-01-08 04:01:17     148555 7b5eca364544bf04f63c390dde7f6e.dir
2019-01-08 04:20:18     865287 7f3444015a68ee039d84986d5f9a98
2019-01-08 04:01:38      18086 8bc9b783a8e0568e71d350e4a1fc37
2019-01-08 04:01:44       9823 95e0f6ac3455b66b69974c3936eba7
2019-01-08 04:19:13     673028 997353cce6f0997dc97e311893f3fb
2019-01-04 22:41:27   54562699 9d7b83b536edc6666c76d16e9bfc6b
2019-01-08 04:20:23     436909 a485af08514391eaf58f7a607a1aaa
2019-01-08 04:19:44     365795 a76a77aa1343aba087211e42f6d2b7
2019-01-08 04:20:40    1232517 ae610c9a0246025b573888e84d766e
2018-12-18 21:41:50     799324 b818767f237d4c9647c1208ce8c28b
2019-01-08 04:20:53     364246 bec0f4fc35ecf1377e62c3859362c4
2019-01-07 07:07:32      59057 cf83393276d56191a88c3d54ef6a5d
2019-01-08 04:02:17      33169 e00d88888f810720aee5b46c3f0772

On server-2 dvc never loads all files

aws  --endpoint=https://ceph.acc.ohsu.edu s3 ls s3://bmeg/dvc/9c/

2019-01-07 20:21:07     783548 01b58cb0faab4ee28a9228552ffd8d
2019-01-07 20:20:50    3719779 2861308e6110dc7f4850cbe331e63a
2019-01-07 20:01:19      14722 2fa17d3b0c9486c5af435329f62151
2019-01-07 20:20:47     849013 416b598605a8fcd6fc04c2edab4edc
2019-01-07 20:02:18      22852 55bb368600627e7e14ad7648d8f26b
2018-12-18 13:41:50      62535 5b6377d120103a5a8e841a7b94ff4c
2019-01-07 20:02:14      39899 5ffd6a38f12b14c6a5e6aafacf133c
2019-01-07 20:19:52     614053 711b4315305a21dabfa44e74740ff7
2019-01-07 20:01:46      23825 765b898179e0c88af48a638dfe6586
2019-01-07 20:01:17     148555 7b5eca364544bf04f63c390dde7f6e.dir
2019-01-07 20:20:18     865287 7f3444015a68ee039d84986d5f9a98
2019-01-07 20:01:38      18086 8bc9b783a8e0568e71d350e4a1fc37
2019-01-07 20:01:44       9823 95e0f6ac3455b66b69974c3936eba7
2019-01-07 20:19:13     673028 997353cce6f0997dc97e311893f3fb
2019-01-04 14:41:27   54562699 9d7b83b536edc6666c76d16e9bfc6b
2019-01-07 20:20:23     436909 a485af08514391eaf58f7a607a1aaa
2019-01-07 20:19:44     365795 a76a77aa1343aba087211e42f6d2b7
2019-01-07 20:20:40    1232517 ae610c9a0246025b573888e84d766e
2018-12-18 13:41:50     799324 b818767f237d4c9647c1208ce8c28b
2019-01-07 20:20:53     364246 bec0f4fc35ecf1377e62c3859362c4
2019-01-06 23:07:32      59057 cf83393276d56191a88c3d54ef6a5d
2019-01-07 20:02:17      33169 e00d88888f810720aee5b46c3f0772

ls -l .dvc/cache/9c/
total 908
-rw-rw-r-- 1 ubuntu ubuntu  62535 Nov  9 16:09 5b6377d120103a5a8e841a7b94ff4c
-rw-rw-r-- 1 ubuntu ubuntu 799324 Nov 11 16:54 b818767f237d4c9647c1208ce8c28b
-rw-rw-r-- 1 ubuntu ubuntu  59057 Nov  9 16:22 cf83393276d56191a88c3d54ef6a5d

dvc fetch runs without errors (although it constantly recalcs md5) dvc pull always returns the following.

Warning: Cache '9c7b5eca364544bf04f63c390dde7f6e.dir' not found. File '{'path': '/mnt/bmeg/bmeg-etl/source/ccle/vcfs', 'scheme': 'local'}' won't be created.

both servers are at the same git branch / commit

Issue Analytics

  • State:closed
  • Created 5 years ago
  • Comments:28 (28 by maintainers)

github_iconTop GitHub Comments

1reaction
bwalshcommented, Mar 2, 2019

Wrote a quick test to confirm:

    s3 = boto3.client('s3',
                      endpoint_url=credentials['endpoint_url'],
                      aws_access_key_id=credentials['access_key'],
                      aws_secret_access_key=credentials['secret_key'])

    response = s3.list_objects_v2(Bucket=bucket_name, Prefix=prefix)
    for item in response['Contents']:
        print(item['Key'])

# retrieves 1000 items (1 page)
./list.py  | awk -F"/" '{print $1 "/" $2}' | sort | uniq
dvc/00
dvc/01
dvc/02
dvc/03
dvc/04
dvc/05
dvc/06
dvc/07
dvc/08
dvc/09
dvc/0a
dvc/0b
dvc/0c
dvc/0d
dvc/0e
dvc/0f
dvc/10
dvc/11
dvc/12
dvc/13
dvc/14
dvc/15
dvc/16
dvc/17
dvc/18
dvc/19
dvc/1a

# same as the results of fetch

hypothesis: only fetching first page

https://github.com/iterative/dvc/blob/9528ad6a1dbe205644431bfb0e02b1e2ae8449bb/dvc/remote/s3.py#L221

Confirmed

# always None
print(response.get("NextContinuationToken", None))
>>> None
print(response.keys())
>>> dict_keys(['ResponseMetadata', 'IsTruncated', 'Contents', 'Name', 'Prefix', 'MaxKeys', 'EncodingType'])

Could we use list_objects instead ?

    paginator = s3.get_paginator('list_objects')
    for page in paginator.paginate(Bucket=bucket_name, Prefix=prefix):
        for item in page['Contents']:
            print(item['Key'])

# returns all the items in the bucket
1reaction
bwalshcommented, Feb 22, 2019

Thank you, we will try it out next week.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Troubleshooting | Data Version Control - DVC
Users may encounter errors when running dvc pull and dvc fetch , like WARNING: Cache 'xxxx' not found. or ERROR: failed to pull...
Read more >
Getting this weird error when trying to run DVC pull
But I am getting this error: WARNING: Some of the cache files do not exist neither locally nor on remote. Missing cache files:...
Read more >
DVC - Data Version Control Cheatsheet - Derek Chia
Next, we create a data directory and then use dvc get to get data from a data ... We can now try to...
Read more >
Data & Model Management with DVC | Analytics Vidhya
DVC uses a remote repository (including supports all major cloud providers) to store all the data and models for a project. In the...
Read more >
shcheklein/example-get-started: Get started DVC project
1-dvc-init : DVC has been initialized. .dvc/ with the cache directory created. 2-track-data : Raw data file data.xml downloaded and tracked with ...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found