question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Dvc is not able to pull files from a public Backblaze S3 remote

See original GitHub issue

Bug Report

Description

Set up a public Backblaze remote and push committed files in dvc. Clone the git repository with only public access informations to Backblaze (no secret_api_key), try a dvc pull that fails.

Reproduce

  1. git init
  2. dvc init
  3. Configure Backblaze remote:
# .dvc/config
[core]
    remote = b2
['remote "b2"']
    url = s3://<BUCKET>/
    endpointurl = https://s3.us-west-000.backblazeb2.com
# .dvc/config.local
['remote "b2"']
    access_key_id = <ACCESS_KEY>
    secret_access_key = <SECRET_KEY>
Set B2 bucket as public.
  1. copy file.txt in the repository
  2. dvc add file.txt
  3. git add .
  4. git commit -m “Initial”
  5. dvc push
  6. Now go to another directory (cd /tmp)
  7. git clone original repository (only public B2 informations are copied)
  8. dvc pull
  9. Exception occurs: ERROR: failed to pull data from the cloud - Unable to find AWS credentials. <https://error.dvc.org/no-credentials>: Unable to locate credentials

Expected

dvc pull retrieves the committed file without problems

Environment information

Output of dvc doctor:

$ dvc doctor
DVC version: 2.1.0 (pip)
---------------------------------
Platform: Python 3.9.5 on Linux-5.12.7-200.fc33.x86_64-x86_64-with-glibc2.32
Supports: gdrive, http, https, s3, ssh
Cache types: <https://error.dvc.org/no-dvc-cache>
Caches: local
Remotes: s3
Workspace directory: tmpfs on tmpfs
Repo: dvc, git

Additional Information (if any): As reported in Backblaze docs files listing always requires authorization.

Access controls are simple. Uploads into a bucket always require authorization. Listing files in a bucket always requires authorization, and deleting files always requires authorization. For downloading files, though, you have the option of requiring authorization, or making all of the files in a bucket the files visible to the public. 

Maybe this is the cause of the error, since it differs from AWS S3 default. On the other hand inside dvc.yaml the md5 of the outputs are explicitly written, so there should be a need

$ dvc pull --debug
2021-06-22 11:39:43,383 DEBUG: Preparing to download data from 's3://pl-experiments/'
2021-06-22 11:39:43,383 DEBUG: Preparing to collect status from s3://pl-experiments/
2021-06-22 11:39:43,384 DEBUG: Collecting information from local cache...
2021-06-22 11:39:43,387 DEBUG: Collecting information from remote cache...                                                                                                                                                                   
2021-06-22 11:39:43,388 DEBUG: Matched '0' indexed hashes
Everything is up to date.                                                                                                                                                                                                                    
2021-06-22 11:39:45,639 ERROR: failed to pull data from the cloud - Unable to find AWS credentials. <https://error.dvc.org/no-credentials>: Unable to locate credentials
------------------------------------------------------------
Traceback (most recent call last):
  File "/home/trenta3/.local/lib/python3.9/site-packages/dvc/fs/s3.py", line 155, in _get_s3
    yield self.s3
  File "/home/trenta3/.local/lib/python3.9/site-packages/dvc/fs/s3.py", line 172, in _get_bucket
    yield s3.Bucket(bucket)
  File "/home/trenta3/.local/lib/python3.9/site-packages/dvc/fs/s3.py", line 284, in _list_paths
    for obj_summary in obj_summaries:
  File "/home/trenta3/.local/lib/python3.9/site-packages/boto3/resources/collection.py", line 83, in __iter__
    for page in self.pages():
  File "/home/trenta3/.local/lib/python3.9/site-packages/boto3/resources/collection.py", line 166, in pages
    for page in pages:
  File "/home/trenta3/.local/lib/python3.9/site-packages/botocore/paginate.py", line 255, in __iter__
    response = self._make_request(current_kwargs)
  File "/home/trenta3/.local/lib/python3.9/site-packages/botocore/paginate.py", line 332, in _make_request
    return self._method(**current_kwargs)
  File "/home/trenta3/.local/lib/python3.9/site-packages/botocore/client.py", line 386, in _api_call
    return self._make_api_call(operation_name, kwargs)
  File "/home/trenta3/.local/lib/python3.9/site-packages/botocore/client.py", line 691, in _make_api_call
    http, parsed_response = self._make_request(
  File "/home/trenta3/.local/lib/python3.9/site-packages/botocore/client.py", line 711, in _make_request
    return self._endpoint.make_request(operation_model, request_dict)
  File "/home/trenta3/.local/lib/python3.9/site-packages/botocore/endpoint.py", line 102, in make_request
    return self._send_request(request_dict, operation_model)
  File "/home/trenta3/.local/lib/python3.9/site-packages/botocore/endpoint.py", line 132, in _send_request
    request = self.create_request(request_dict, operation_model)
  File "/home/trenta3/.local/lib/python3.9/site-packages/botocore/endpoint.py", line 115, in create_request
    self._event_emitter.emit(event_name, request=request,
  File "/home/trenta3/.local/lib/python3.9/site-packages/botocore/hooks.py", line 356, in emit
    return self._emitter.emit(aliased_event_name, **kwargs)
  File "/home/trenta3/.local/lib/python3.9/site-packages/botocore/hooks.py", line 228, in emit
    return self._emit(event_name, kwargs)
  File "/home/trenta3/.local/lib/python3.9/site-packages/botocore/hooks.py", line 211, in _emit
    response = handler(**kwargs)
  File "/home/trenta3/.local/lib/python3.9/site-packages/botocore/signers.py", line 90, in handler
    return self.sign(operation_name, request)
  File "/home/trenta3/.local/lib/python3.9/site-packages/botocore/signers.py", line 162, in sign
    auth.add_auth(request)
  File "/home/trenta3/.local/lib/python3.9/site-packages/botocore/auth.py", line 373, in add_auth
    raise NoCredentialsError()
botocore.exceptions.NoCredentialsError: Unable to locate credentials

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/home/trenta3/.local/lib/python3.9/site-packages/dvc/command/data_sync.py", line 29, in run
    stats = self.repo.pull(
  File "/home/trenta3/.local/lib/python3.9/site-packages/dvc/repo/__init__.py", line 49, in wrapper
    return f(repo, *args, **kwargs)
  File "/home/trenta3/.local/lib/python3.9/site-packages/dvc/repo/pull.py", line 29, in pull
    processed_files_count = self.fetch(
  File "/home/trenta3/.local/lib/python3.9/site-packages/dvc/repo/__init__.py", line 49, in wrapper
    return f(repo, *args, **kwargs)
  File "/home/trenta3/.local/lib/python3.9/site-packages/dvc/repo/fetch.py", line 62, in fetch
    downloaded += self.cloud.pull(
  File "/home/trenta3/.local/lib/python3.9/site-packages/dvc/data_cloud.py", line 88, in pull
    return remote.pull(
  File "/home/trenta3/.local/lib/python3.9/site-packages/dvc/remote/base.py", line 56, in wrapper
    return f(obj, *args, **kwargs)
  File "/home/trenta3/.local/lib/python3.9/site-packages/dvc/remote/base.py", line 486, in pull
    ret = self._process(
  File "/home/trenta3/.local/lib/python3.9/site-packages/dvc/remote/base.py", line 323, in _process
    dir_status, file_status, dir_contents = self._status(
  File "/home/trenta3/.local/lib/python3.9/site-packages/dvc/remote/base.py", line 175, in _status
    self.hashes_exist(
  File "/home/trenta3/.local/lib/python3.9/site-packages/dvc/remote/base.py", line 132, in hashes_exist
    return indexed_hashes + self.odb.hashes_exist(list(hashes), **kwargs)
  File "/home/trenta3/.local/lib/python3.9/site-packages/dvc/objects/db/base.py", line 408, in hashes_exist
    remote_size, remote_hashes = self._estimate_remote_size(hashes, name)
  File "/home/trenta3/.local/lib/python3.9/site-packages/dvc/objects/db/base.py", line 230, in _estimate_remote_size
    remote_hashes = set(hashes)
  File "/home/trenta3/.local/lib/python3.9/site-packages/dvc/objects/db/base.py", line 184, in _hashes_with_limit
    for hash_ in self.list_hashes(prefix, progress_callback):
  File "/home/trenta3/.local/lib/python3.9/site-packages/dvc/objects/db/base.py", line 174, in list_hashes
    for path in self._list_paths(prefix, progress_callback):
  File "/home/trenta3/.local/lib/python3.9/site-packages/dvc/objects/db/base.py", line 154, in _list_paths
    for file_info in self.fs.walk_files(path_info, prefix=prefix):
  File "/home/trenta3/.local/lib/python3.9/site-packages/dvc/fs/s3.py", line 290, in walk_files
    for fname in self._list_paths(path_info, **kwargs):
  File "/home/trenta3/.local/lib/python3.9/site-packages/dvc/fs/s3.py", line 285, in _list_paths
    yield obj_summary.key
  File "/usr/lib64/python3.9/contextlib.py", line 135, in __exit__
    self.gen.throw(type, value, traceback)
  File "/home/trenta3/.local/lib/python3.9/site-packages/dvc/fs/s3.py", line 175, in _get_bucket
    raise DvcException(
  File "/usr/lib64/python3.9/contextlib.py", line 135, in __exit__
    self.gen.throw(type, value, traceback)
  File "/home/trenta3/.local/lib/python3.9/site-packages/dvc/fs/s3.py", line 158, in _get_s3
    raise DvcException(
dvc.exceptions.DvcException: Unable to find AWS credentials. <https://error.dvc.org/no-credentials>
------------------------------------------------------------
2021-06-22 11:39:45,658 DEBUG: Analytics is enabled.
2021-06-22 11:39:45,794 DEBUG: Trying to spawn '['daemon', '-q', 'analytics', '/tmp/tmpb1sj_5dk']'
2021-06-22 11:39:45,797 DEBUG: Spawned '['daemon', '-q', 'analytics', '/tmp/tmpb1sj_5dk']'

Issue Analytics

  • State:open
  • Created 2 years ago
  • Comments:11 (4 by maintainers)

github_iconTop GitHub Comments

2reactions
shchekleincommented, Jun 22, 2021

Would it possible to use HTTP remote for read. S3 can provide an HTTP endpoint for a bucket. Does Backblaze have something like this?

We use this setup in the example-get-started repo. When we push we use S3 remote (e.g. with -r flag), but default one is set to HTTP which redirects to the endpoint that S3 provides.

In case on HTTP we don’t need any special permissions I think, but it can be slower in certain scenarios.

1reaction
paredcommented, Jun 23, 2021

@jdonzallaz Ok found our conversation: So my comment was regarding https://github.com/iterative/dvc/issues/5797 which is still work in progress and might help in this use case.

Read more comments on GitHub >

github_iconTop Results From Across the Web

BackBlaze B2 permission error #7372 - iterative/dvc - GitHub
I am getting an error when running dvc push command in python to push to a BackBlaze B2 (S3 compatible) remote.
Read more >
Troubleshooting | Data Version Control - DVC
Failed to pull data from the cloud · Too many open files error · Unable to find credentials · Unable to connect ·...
Read more >
DVC and Backblaze B2 for Reliable & Reproducible Data ...
First we add a new remote to DVC. The -d flag sets this as the default (so that when we push it will...
Read more >
AWS S3 Compatible API - Dramatically Lower Your Costs
See how the Backblaze B2 S3 compatible API can dramatically lower your cloud storage costs. Click to learn more.
Read more >
Avoid Git LFS if possible - Hacker News
Git LFS has the advantage of not pulling all versions of a large file, too. Instead, it only pulls the version it's checking...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found