Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

perf: remote indexes

See original GitHub issue

As we discussed making dvc fast should be high priority as poor performance can draw people away easily. The big part of todays slowness is working with remotes, which almost always includes collecting file statuses, which could be slow for bigger remotes. All this leads to some form of indexes.

However, our remotes don’t provide a luxury of atomic group writes nor reads nor read-modify-write operations. We still can use the following strategy:

make an index dir on remote, say index,
write a list of files (with names, checksums, mtimes and/or file sizes) to an index file in that dir say 1.idx
later when some client needs to update that, e.g. after pushing some new files it:
- reads current index,
- updates it,
- writes new list to 2.<uuid>.idx,
- removes 1.idx. This way in a case of a race we will have several index files.
if a client needs to read an index it downloads all index files and combines them.

Since we will not only have adds, but also deletes we will need smart combine procudure like in CRDTs.

File format to be discussed, simple JSON or gzipped JSON with a list of files may do the job though.

What do you guys think? @shcheklein @dmpetrov @efiop @pared @mroutis

Issue Analytics

State:
Created 4 years ago
Reactions:6
Comments:13 (13 by maintainers)

Top GitHub Comments

2reactions

Suorcommented, Aug 20, 2019

A note on garbage collection.

Since git has a distributed nature and we are collecting everything not referenced we may remove something recently pushed and only referenced in git commits not available locally yet - not pulled or not even pushed yet by an author.

The sane method to circumvent this is providing a grace period: do not gc anything newer than N days. In the case someone has just pushed something one shouldn’t and wants to remove that it should be still possible to do that:

dvc gc -c --grace-period=0

This is “I know what I am doing even thoufgh I just messed up flag” 😃

The grace period might simplify some conflict resolution above. E.g. we might use normal listing in gc instead of index listing to collect orphane files. This should be discussed in #2325.

1reaction

efiopcommented, Feb 20, 2022

We’ll create a ticket in dvc-objects/data for this kind of odb indexing with a summary and in the mid term and will close this.

Top Results From Across the Web

perf-list(1) - Linux manual page - man7.org

When metrics are computed using formulas from event counts, it is useful to ensure some events are always measured together as a group...

Migrating Amazon OpenSearch Service indexes using remote ...

Remote reindex lets you copy indexes from one Amazon OpenSearch Service cluster to another. You can migrate indexes from any OpenSearch Service domains...

Guide to using the remote index search - IBM

To use the remote index search, an index is first created for the container, and then this index is used to rapidly search...

Remote Reindex Performance - Elasticsearch - Elastic Discuss

The performance of the local cluster; The shard count of the target index (higher shard counts can increase indexing speed). If you need...

Index types in Cloud Firestore - Firebase - Google

Cloud Firestore guarantees high query performance by using indexes for all queries. As a result, query performance depends on the size of the...