question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

perf: remote indexes

See original GitHub issue

As we discussed making dvc fast should be high priority as poor performance can draw people away easily. The big part of todays slowness is working with remotes, which almost always includes collecting file statuses, which could be slow for bigger remotes. All this leads to some form of indexes.

However, our remotes don’t provide a luxury of atomic group writes nor reads nor read-modify-write operations. We still can use the following strategy:

  • make an index dir on remote, say index,
  • write a list of files (with names, checksums, mtimes and/or file sizes) to an index file in that dir say 1.idx
  • later when some client needs to update that, e.g. after pushing some new files it:
    • reads current index,
    • updates it,
    • writes new list to 2.<uuid>.idx,
    • removes 1.idx. This way in a case of a race we will have several index files.
  • if a client needs to read an index it downloads all index files and combines them.

Since we will not only have adds, but also deletes we will need smart combine procudure like in CRDTs.

File format to be discussed, simple JSON or gzipped JSON with a list of files may do the job though.

What do you guys think? @shcheklein @dmpetrov @efiop @pared @mroutis

Issue Analytics

  • State:open
  • Created 4 years ago
  • Reactions:6
  • Comments:13 (13 by maintainers)

github_iconTop GitHub Comments

2reactions
Suorcommented, Aug 20, 2019

A note on garbage collection.

Since git has a distributed nature and we are collecting everything not referenced we may remove something recently pushed and only referenced in git commits not available locally yet - not pulled or not even pushed yet by an author.

The sane method to circumvent this is providing a grace period: do not gc anything newer than N days. In the case someone has just pushed something one shouldn’t and wants to remove that it should be still possible to do that:

dvc gc -c --grace-period=0

This is “I know what I am doing even thoufgh I just messed up flag” 😃

The grace period might simplify some conflict resolution above. E.g. we might use normal listing in gc instead of index listing to collect orphane files. This should be discussed in #2325.

1reaction
efiopcommented, Feb 20, 2022

We’ll create a ticket in dvc-objects/data for this kind of odb indexing with a summary and in the mid term and will close this.

Read more comments on GitHub >

github_iconTop Results From Across the Web

perf-list(1) - Linux manual page - man7.org
When metrics are computed using formulas from event counts, it is useful to ensure some events are always measured together as a group...
Read more >
Migrating Amazon OpenSearch Service indexes using remote ...
Remote reindex lets you copy indexes from one Amazon OpenSearch Service cluster to another. You can migrate indexes from any OpenSearch Service domains...
Read more >
Guide to using the remote index search - IBM
To use the remote index search, an index is first created for the container, and then this index is used to rapidly search...
Read more >
Remote Reindex Performance - Elasticsearch - Elastic Discuss
The performance of the local cluster; The shard count of the target index (higher shard counts can increase indexing speed). If you need...
Read more >
Index types in Cloud Firestore - Firebase - Google
Cloud Firestore guarantees high query performance by using indexes for all queries. As a result, query performance depends on the size of the...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found