question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

cloud versioning: pull non-DVC cloud updates

See original GitHub issue

In a project with a cloud-versioned remote, there could be either manual or automated updates happening on the remote outside of any DVC process. For example, a daily ETL process uploads new data directly to the cloud remote. Or someone unfamiliar with DVC wants to upload some new data.

A DVC user should be able to not only pull the versions of data tracked in their project, but also pull the latest version of data available on the cloud. For example, dvc update could be used to get the latest data the same way it does today for imports.

Issue Analytics

  • State:closed
  • Created a year ago
  • Comments:5 (2 by maintainers)

github_iconTop GitHub Comments

1reaction
dberenbaumcommented, Nov 29, 2022

One thing to clarify that may simplify this: For now, it’s fine to keep the current syntax for dvc update where it requires a target. We may need to support dvc update without a target eventually, but for now being able to update a dataset at a time is enough.

@dberenbaum one thing we didnt discuss was how to handle deletion for a standalone file output, where a tracked file no longer exists in the bucket (and has DELETE marker set).

If you import data with import/import-url, then delete the source and run dvc update, it will fail. Let’s stay consistent with that for now.

1reaction
pmrowlacommented, Nov 29, 2022

After discussion with @dberenbaum, scope for this issue on initial release will be

  • dvc update will only be applicable in worktree = true scenarios (and not version_aware = true, worktree = false)
  • update should pull the latest versions of outs we are aware of (and have .dvc files for locally).
    • For a file output, update would get latest modifications to the file
    • For a dir output, update would get latest modifications to files in the dir, newly added files in the dir, and deletions of files within the dir
  • update will ignore files/dirs in the remote bucket that we are not aware of already (i.e. new files/dirs that are in the bucket but do not appear in any of our local .dvc/dvc.yaml files)
Read more comments on GitHub >

github_iconTop Results From Across the Web

Get Started: Data Versioning - DVC
Get started with data and model versioning in DVC. Learn how to use a standard Git workflow for datasets and ML models, without...
Read more >
Data & Model Management with DVC | Analytics Vidhya
In this post we learn about versioning for ML projects & use DVC to version & maintain ML artifacts in a remote Amazon...
Read more >
Syncing Data to GCP Storage Buckets - Iterative.ai
Setting up a remote to make data versioning easier with DVC is a common ... has huge datasets, it's common to store them...
Read more >
Versioning a shared dataset using DVC and S3 - GitHub Pages
In that case, the experiment results on the cloud server and those on the lab server will not match even if the same...
Read more >
Data and Machine Learning Model Versioning with DVC
This can be on Amazon S3, Google Drive, Azure, Google Cloud ... After the dvc pull command, you'll notice that our twitter_1.csv dataset...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found