question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

new command: put-url OR rsync/rclone

See original GitHub issue

Summary

An upload equivalent of dvc get-url.

We currently use get-url as a cross-platform replacement for wget. However, together with get-url, put-url will turn DVC into a replacement for rsync/rclone.

Motivation

  • we already have get-url so adding put-url seems natural for the same reasons
  • put-url will be used by
    • CML internally to sync data
    • LDB internally to sync data
    • the rest of the world
  • uses existing functionality of DVC so should be fairly quick to expose
  • cross-platform multi-cloud replacement for rsync/rclone. What’s not to love?
    • could even create a spin-off thin wrapper (or even abstract the functionality) in a separate Python package

Detailed Design

usage: dvc put-url [-h] [-q | -v] [-j <number>] url targets [targets ...]

Upload or copy files to URL.
Documentation: <https://man.dvc.org/put-url>

positional arguments:
  url                   Destination path to put data to.
                        See `dvc import-url -h` for full list of supported
                        URLs.
  targets               Files/directories to upload.

optional arguments:
  -h, --help            show this help message and exit
  -q, --quiet           Be quiet.
  -v, --verbose         Be verbose.
  -j <number>, --jobs <number>
                        Number of jobs to run simultaneously. The default
                        value is 4 * cpu_count(). For SSH remotes, the default
                        is 4.

How We Teach This

Drawbacks

  • can’t think of any

Alternatives

  • would have to re-implement per-cloud sync options for CML & other products

Unresolved Questions

  • minor implementation details
    • CLI naming (put-url)?
    • CLI argument order (url targets [targets...])?
    • Python API (dvc.api.put_url())?

Please do assign me if happy with the proposal.

(dvc get-url + put-url = dvc rsync 😃)

Issue Analytics

  • State:open
  • Created 2 years ago
  • Reactions:5
  • Comments:26 (26 by maintainers)

github_iconTop GitHub Comments

4reactions
dmpetrovcommented, Mar 9, 2022

I’m trying to aggregate our discussions here and in person to action points:

  1. [Must-have] dvc export that should upload a local file to a cloud and preserve a link (.dvc file) similar to result of dvc import-url.
  2. [Nice-to-have] dvc put-url. It is not a part of use cases (see below) but something like this needs to work under the hood of dvc export anyway. And it might be handy for other scenarios.
  3. [Nice-to-have] dvc import-url ---etags-only (--no-exec but it gets etags from cloud) and/or dvc update --etags-only. This is needed to track file statuses when file is not downloaded locally.

Important:

  • All these commands have to support not-DVC environment. And even not-Git environment.
  • All these commands have to support directories since a model might be a directory (this might be postpone for a later iteration).

Below are user use cases that should help to understand the scenarios.

From local to Cloud/S3

A model out/model.h5 is saved in a local directory: local machine or cloud/TPI or CML, it might be DVC/Git or just a directory like ~/. The model needs to be uploaded to a specified place/url in a cloud/S3. User needs to keep the pointer file (.dvc) for future use.

Why user needs the pointer file:

  • for a record /linage
  • for 3rd party tool (deployment for example) or dvc get to download the file
  • to check status - if the file was changed

Uploading

$ dvc export out/model.h5 s3://mybucket/ml/prod/my-model.h5
To track the changes with git, run:

    git add out/model.h5.dvc .gitignore
$ git add out/model.h5.dvc
$ git commit -m 'exporting a file'

Note, This command is an equivalent to aws s3 cp file s3://path && dvc import-url s3://path file. We can consider introducing a separate command to cover the copy part in cross-cloud way - dvc put-url. However, the priority is not high in the context of the scenario.

Updating

A model file was changed (as a result of re-training) for example:

$ dvc update out/model.h5.dvc # It should work now if the Uploading part is based on `import-url`
To track the changes with git, run:

    git add out/model.h5.dvc .gitignore
$ git add out/model.h5.dvc
$ git commit -m 'File was changed in S3'

From cloud to workspace

Users write models/data to cloud from user’s code (or it is already updated by an external tool). Saving pointer to a model file still might be useful. Why:

  • for a record /linage
  • for 3rd party tool (deployment for example) or dvc get to download the file
  • to know how to updated it if models changes

Tracking a cloud file

After training is done and a file saved to s3://mybucket/ml/prod/2022-03-07-model.h5:

$ dvc import-url s3://mybucket/ml/prod/2022-03-07-model.h5 my-model.h5.dvc
To track the changes with git, run:

    git add out/model.h5.dvc .gitignore
$ git add out/model.h5.dvc
$ git commit -m 'exporting a file'

Tracking a cloud file without a local copy

In some cases, user does writes a file in a storage and does not need a copy in workspace. dvc import-url --no-exec seems like a good option to cover this case.

$ dvc import-url --no-exec s3://mybucket/ml/prod/2022-03-07-model.h5 my-model.h5.dvc
To track the changes with git, run:

    git add out/model.h5.dvc .gitignore
$ git add out/model.h5.dvc
$ git commit -m 'exporting a file'

Technically, the file will still have a virtual representation in the workspace as my-model.h5. However, it won’t be materialized until dvc update my-model.h5.dvc is called.

Pros/Cons:

  • [Pros] Is it consistent with the existent dvc commands.
  • [Pros] GitOps can reference to a “virtual” model file. CC @aguschin
  • [Cons] The .dvc file does not have checksums and etags. User cannot recognize if the file was changed in the cloud or not (compared to the last time import-url was called).

To cover the latest cons, we can consider introducing dvc import-url ---etags-only (--no-exec but get etags from cloud) and/or dvc update --etags-only.

2reactions
dberenbaumcommented, Mar 11, 2022

From local to Cloud/S3

In this scenario, the user has their own local model.h5 file already. It may or may not be tracked by DVC. If it is tracked by DVC, it might be tracked in model.h5.dvc or within dvc.lock (if it’s generated by a DVC stage).

If they want to upload to the cloud and keep a pointer locally, dvc export can be equivalent to dvc run --external -n upload_data -d model.h5 -o s3://testproject/model.h5 aws s3 cp model.h5 s3://testproject/model.h5. This is the inverse of import-url, as shown in the example in https://dvc.org/doc/command-reference/import-url#description.

As @shcheklein noted, the workflow here assumes the user saves updates locally, so it makes sense for update to go in the upload direction and enforce a canonical workflow of save locally -> upload new version.

Similar to how import-url records the external path as a dependency and the local path as an output, export can record the local path as a dependency and the local path as an output. Since a model.h5.dvc file may already exist from a previous dvc add (with model.h5 as an output), it might make more sense to save the export info with some other file extension, like model.h5.export.dvc (this avoids conflicts between the dependencies and outputs of each).

I’ll follow up on the other scenarios in another comment to keep this from being too convoluted 😅

Edit: On second thought, maybe it’s better to resolve this scenario first 😄 . The others might require a separate discussion.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Tcl Built-In Commands - Http manual page
The ::http::geturl command returns a token value that can be used to get information about the transaction. See the STATE ARRAY and ERRORS...
Read more >
http - the Tcler's Wiki!
Commands. http::geturl: Performs an HTTP transaction. ... Darren New observes that gethostbyname() can't be trusted to be thread-safe .
Read more >
Generating a presigned URL to upload an object
Upload Amazon S3 objects using presigned URLs when someone has given you permissions to access the object identified in the URL.
Read more >
Rclone
Rclone is a command-line program to manage files on cloud storage. ... Copy new or changed files to cloud storage; Sync (one way)...
Read more >
Trying to register commands: DiscordAPIError[50001]: Missing ...
Have you made sure that the 'applications.commands' scope is checked in the scopes section of the OAuth2 settings for your bot in the ......
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found