perf: remote indexes
See original GitHub issueAs we discussed making dvc fast should be high priority as poor performance can draw people away easily. The big part of todays slowness is working with remotes, which almost always includes collecting file statuses, which could be slow for bigger remotes. All this leads to some form of indexes.
However, our remotes don’t provide a luxury of atomic group writes nor reads nor read-modify-write operations. We still can use the following strategy:
- make an index dir on remote, say
index
, - write a list of files (with names, checksums, mtimes and/or file sizes) to an index file in that dir say
1.idx
- later when some client needs to update that, e.g. after pushing some new files it:
- reads current index,
- updates it,
- writes new list to
2.<uuid>.idx
, - removes
1.idx
. This way in a case of a race we will have several index files.
- if a client needs to read an index it downloads all index files and combines them.
Since we will not only have adds, but also deletes we will need smart combine procudure like in CRDTs.
File format to be discussed, simple JSON or gzipped JSON with a list of files may do the job though.
What do you guys think? @shcheklein @dmpetrov @efiop @pared @mroutis
Issue Analytics
- State:
- Created 4 years ago
- Reactions:6
- Comments:13 (13 by maintainers)
Top Results From Across the Web
perf-list(1) - Linux manual page - man7.org
When metrics are computed using formulas from event counts, it is useful to ensure some events are always measured together as a group...
Read more >Migrating Amazon OpenSearch Service indexes using remote ...
Remote reindex lets you copy indexes from one Amazon OpenSearch Service cluster to another. You can migrate indexes from any OpenSearch Service domains...
Read more >Guide to using the remote index search - IBM
To use the remote index search, an index is first created for the container, and then this index is used to rapidly search...
Read more >Remote Reindex Performance - Elasticsearch - Elastic Discuss
The performance of the local cluster; The shard count of the target index (higher shard counts can increase indexing speed). If you need...
Read more >Index types in Cloud Firestore - Firebase - Google
Cloud Firestore guarantees high query performance by using indexes for all queries. As a result, query performance depends on the size of the...
Read more >
Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free
Top Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
A note on garbage collection.
Since git has a distributed nature and we are collecting everything not referenced we may remove something recently pushed and only referenced in git commits not available locally yet - not pulled or not even pushed yet by an author.
The sane method to circumvent this is providing a grace period: do not gc anything newer than N days. In the case someone has just pushed something one shouldn’t and wants to remove that it should be still possible to do that:
This is “I know what I am doing even thoufgh I just messed up flag” 😃
The grace period might simplify some conflict resolution above. E.g. we might use normal listing in gc instead of index listing to collect orphane files. This should be discussed in #2325.
We’ll create a ticket in dvc-objects/data for this kind of odb indexing with a summary and in the mid term and will close this.